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ABSTRACT 

In this thesis a new algorithm is proposed for detecting outliers in large and 
very large data sets. This procedure uses Tukey's biweight function to assign 
weights to data values in each dimension, then reassigns a weight of one to those 
points with weight above a certain cutoff value and zero to those below. The 
sample mean and covariance can be efficiently calculated over those observations 
with weight equal to one, leading to robust Mahalanobis distances for all the 
observations. We estimate the density of these Mahalanobis distances and deter- 
mine a rejection point where the slope of the density is sufficiently close to zero. 
All observations with Mahalanobis distance greater than this rejection point are 
declared outliers. This algorithm demonstrates very good outlier identification 
properties, especially on non-normal data, an area where most existing methods 
do not perform as well. We successfully model and predict its error rates for a 
mixture of chi-squared distributions and various parameters such as amount of 
contamination, separation of chi-squared distributions, and data dimension. It is 
computationally fast, and is very capable of analyzing data sets containing both 
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large number of observations and high dimensions. To analyze performance of 
the robust estimation procedure from a theoretical perspective, we also calculate 
several important asymptotic robustness properties at the standard Normal dis- 
tribution. Among others, this estimator possesses a low gross error sensitivity, 
a high relative efficiency, and a high breakdown point, as well as considerable 
flexibility that enables it to adapt to a variety of data situations and contami- 
nation levels. Compared to other methods that we have examined, this outlier 
algorithm is considerably faster, and with special reference to non-normal data, 
offers excellent all-round performance. 

This abstract accurately represents the content of the candidate's thesis. I 
recommend its publication. 

Signed 

Karen Kafadar 
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1. Introduction 

Awareness of outliers has existed for at least several hundred years. In 1620, 
Francis Bacon [4] wrote, 

Whoever knows the ways of Nature will more easily notice her de- 
viations; and, on the other hand, whoever knows her deviations will 
more accurately describe her ways. 

Even as early as the first pubhshed work on least squares in 1805, Legendre 
[48] explicitly mentioned rejection of outliers to improve accuracy and reduce 
error: 

If among these errors are some which appear too large to be ad- 
missible, then those equations which produced these errors will be 
rejected, as coming from too faulty experiments, and the unknowns 
will be determined by means of the other equations, which will then 
give much smaller errors. 

Building on this idea, in 1887 Edgeworth [20] elegantly stated: 

The method of Least Squares is seen to be our best course when 
we have thrown overboard a certain portion of our data — a sort 
of sacrifice which often has to be made by those who sail upon the 
stormy seas of Probabihty. 
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The outlier challenge is one of the earUest of statistical interests, and since 
nearly all data sets contain outliers of varying percentages, it continues to be 
one of the most important. Sometimes outliers can grossly distort the statistical 
analysis, at other times their influence may not be as noticeable. Statisticians 
have accordingly developed numerous algorithms for the detection and treatment 
of outliers, but most of these methods were developed for univariate data sets. 
This thesis focuses instead on multivariate outlier detection. 

Especially when using some of the common summary statistics such as the 
sample mean and variance, outliers can cause the analyst to reach a conclusion 
totally opposite to the case if outliers weren't present. For example, a hypothesis 
might or might not be declared significant due to a handful of extreme outliers. 
In fitting a regression line via least squares, outliers can sufficiently alter the 
slope so as to induce a sign change. For a graphical illustration of the latter, see 
[70, p. 5]. 

Sometimes outliers can draw attention to important facts, or lead to new 
discoveries, e.g. Rayleigh's discovery of argon resulting from unexpected dif- 
ferences in measured weights of nitrogen [82, pp. 49-53]. During one of my 
consulting projects, 1 was asked to evaluate patient satisfaction across various 
categories (quality of physician care, attentiveness of nurses, etc.) from different 
hospitals. One of the hospitals consistently received the highest score, in terms 
of patient satisfaction, across all categories except for smoothness of billing pro- 
cess. Since it was almost flawless in every other category, I thought there had 
to be a mistake in the data. After discussing this peculiarity with the project 
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coordinator, I learned that this hospital had recently switched to a different ac- 
counting package, which led to many incorrect accounts and frustration among 
the patients. With reference to the distribution of patient satisfaction indices, 
the billing score from this hospital was clearly an outlier — it was an extremely 
unusual data point. We certainly did not want to discard this point; on the con- 
trary, it highlighted a very important fact — this hospital needed to significantly 
improve its billing process, or risk losing business. Often it will be important to 
calculate how unlikely a particular observation is, but in this case simple visual 
inspection was sufficient. It was certainly beneficial to become aware of this 
outlier. The hospital took appropriate action and when I analyzed the following 
year's data, their billing score was on par with the rest of their scores. 

Another example where outliers themselves are of primary importance in- 
volves air safety, as discussed in Kafadar and Morris [46]. On one of the four 
hijacked flights on September 11, 5 of the 80 passengers satisfled the follow- 
ing criteria: they were not U.S. citizens, they had connections to a particular 
foreign country, had purchased one-way tickets at the gate, had paid in cash, 
and none had checked luggage. One or two passengers satisfying these criteria 
might not be too rare, but five passengers on the same flight, especially with 
the beneflts of hindsight, would now seem to be highly unusual. We of course 
fully recognize the fact that before September 11, these passengers might not 
have raised sufficient concern to spur further investigation because this type of 
attack had rarely been encountered before. Some level of concern was raised by 
an automated screening process, but the recommended action was to "re-screen 
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the checked bags" . Both the identification of outhers and the appropriate action 
are therefore important. It would be ideal to research and identify criteria that 
potentially troublesome passengers might possess; a successful (and fast) outlier 
detection algorithm is imperative to correctly identify outliers amongst large 
masses of regular passenger data, so that a human agent can be alerted to the 
possibility of trouble and investigate the matter in person. Further applications 
of outlier identification to homeland security are described in an article by Banks 
[5] . Our goal in this thesis is to identify the outliers in a data set as successfully 
as possible, after which the analyst can decide what to do with them. 

1.1 Univariate Outliers 

In univariate data, the concept of outlier seems relatively simple to define. 

Outliers are those points located "far away" from the majority of the data; they 
"probably do not" follow the assumed model. A simple plot of the data, such 
as scatterplot, stem-and-leaf plot, QQ-plot, etc., can often reveal which points 
are outliers. This is sometimes called the "interoccular test" because it hits you 
between the eyes. 

Relying solely on non- visual means, however, even univariate outliers might 
not be as easy to identify as would appear at first sight. Consider a simple 
situation: identifying outliers in a normally distributed data set. A naive outlier 
identification rule might be to calculate the sample mean and standard deviation, 
and classify as outliers all points further than 2 or 3 standard deviations away 
from the mean. For Gaussian data, this would label as outliers those points 
far enough from the main data that we think there is only a 4.56% or 0.26% 
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chance, respectively, that they belong to this distribution. For n truly Gaussian 

observations, this would also cause 0.0456n and 0.0026n regular data points to be 

labeled as outliers — false positives. Conversely, outliers that an algorithm fails 

to detect are referred to as false negatives. It is an unfortunate fact, however, 

that the presence of two or more outliers could render some or most of the outliers 

invisible to this method. If there is one or more distant outlier and one or more 

not so distant outlier in the same direction, the more distant outlier(s) could 

significantly shift the mean in that direction, and also increase the standard 

deviation, to such an extent that the lesser outlier(s) falls less than 2 or 3 

standard deviations from the sample mean, and goes undetected. This is called 

the masking effect., and results in this particular method and all related methods 

being unsuitable for use as outlier identification techniques. We illustrate with 

a simple example, borrowed from Becker and Gather [8] . 

Consider a data set of 20 observations taken from a A^(0, 1) distribution: 
-2.21, -1.84, -0.95, -0.91, -0.36, -0.19, -0.11, -0.10, 0.18, 0.30, 

0.31, 0.43, 0.51, 0.64, 0.67, 0.72, 1.22, 1.35, 8.1, 17.6, 
where the latter two observations were originally 0.81 and 1.76, but the decimal 

points were entered at the wrong place. It seems clear that these 2 observations 

should be labeled as outhers; let us apply the above method. The mean of this 

data set is 1.27 while the standard deviation is 4.35. Two standard deviations 

from the mean, towards the right, would be 9.97, while three standard deviations 

would be 14.32. Both criteria regard the point, 8.1, as expected with reasonable 

probability and do not consider it an outlier. Additionally, the three standard 

deviation boundary for detecting outliers seems rather extreme — for a A'"(0, 1) 
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data set, surely a point wouldn't have to be as large as 14.32 to be classified as an 
outlier. The masking effect occurs quite commonly in practice and we conclude 
that outlier methods based on classical statistics are unsuitable for general use, 
particularly in situations requiring non-visual techniques such as multivariate 
data. It is worth noting, however, that if instead of the sample mean and 
standard deviation, robust estimates of location and scale were used (such as 
the sample median, and median absolute deviation, MAD), both outliers would 
be detected without difficulty. 

If it is known that at most one outlier is present, the naive method would 
find it. An improved version of this rule forms the basis of Grubbs' test for a 
single outlier [29], which measures how many standard deviations away from the 
mean the smallest and largest observations arc located, and can be extended to 
Grubbs' test for k outliers, where k has to be known in advance. Thode [37] 
compares Grubbs' and a variety of other outlier tests, including those that do 
not require k to be specified in advance, but we believe that better methods 
exist for both univariate and multivariate outliers. 

Related to the masking effect is the concept of breakdown point of an esti- 
mator. Intuitively, the breakdown point is the least amount of contamination 
that could change the estimator so much that it fails to meaningfully describe 
the data set. As originally defined, the breakdown point e* depends on the 
distance between two distributions (e.g. the Prohorov or Kolmogorov distance, 
or total variation distance [35, p. 97]), but an empirically more useful version 
is the finite-sample breakdown point e* , formally defined with reference to the 
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estimator = T{xi, . . . , as 

e;(r„; xi,..., Xn) := -max{m; maxi,,...,i„supj^.^ ...^^.^ \Tn{zi, . . . , < oo}, 

where the sample {zi,...,Zn) is obtained by replacing the m data points 
^ii) ■ ■ ■ ) ^im with arbitrary values yi^, . . . , yi^. In most cases, lim^^oo — 
The highest possible value of the breakdown point is 50%, because if more than 
half the data were outliers, it would be unclear which data were from the main 
distribution and which were outliers. 

Neither the sample mean nor the sample variance has a high breakdown 
point. In fact, since it takes only one extremely distant point to shift or in- 
crease these respective estimators so far as to be essentially useless, both have a 
breakdown point of 0 for the finite-sample breakdown point). On the other 
hand, the median has a breakdown point of 50%, because half of the observa- 
tions (strategically selected to be as malicious as possible) need to be moved 
arbitrarily far away before the median becomes unusable. 

1.2 MuItivEiriate Outliers 

Multivariate outliers pose challenges that do not occur with univariate data. 
For instance, visual methods simply do not work. Even plotting the data in 
bivariate form with a systematic rotation of coordinate pairs will not help. It 
is possible (and occurs frequently in practice) that points which are outliers 
in bivariate space, are not outliers in either of the two univariate subsets — 
see Figure 1.1. Generalization to higher dimensions leads to the fact that a 
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multivariate outlier does not have to be an outlier in any of its univariate or 
bivariate coordinates, at least not without some kind of transformation, such as 
a 45° rotation of the axes in Figure 1.1. Gnanadcsikan and Kettcnring [28] state 
that visual detection of multivariate outliers is virtually impossible because the 
outhers do not "stick out on the end." 

A successful method of identifying outliers in all multivariate situations 
would be ideal, but is unrealistic. By "successful" , we mean both highly sensi- 
tive, the ability to detect genuine outliers, and highly specific, the ability to not 
mistake regular points for outliers. In the words of Gnanadcsikan and Ketten- 
ring [28], "it would be fruitless to search for a truly omnibus outlier detection 
procedure." Thus we introduce the problem of multivariate outlier detection. 

Intuitively, we label a point an outlier because it is sufficiently "far away" 
from the majority of the data. An important tool for quantifying "far away" , is 
the Mahalanobis distance, defined as 

MA ^^{^i- T(X))^C(X)-^(x, - T(X)) 

for each point Xj, where T(X) is the coordinate- wise sample mean of the data 
set X, and C(X) is the sample covariance matrix. Cleary, the Mahalanobis 
distance rehes on classical location and scale estimators. As such, it is subject 
to the masking effect, and is not suitable for general use in contaminated data. 

There are several qualities we would like an estimator to possess. We would 
like it to have a high breakdown point, but this could be outweighed by other 
criteria. (If we were forced to use an estimator with breakdown point near 50%, 
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Figure 1.1: Bivariate outliers that are not univariate outliers 
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we should probably first do something else to clean the data.) For realistic 
applications, a breakdown point of 20% is usually satisfactory. 

Another desirable property of an estimator is affine equivariance. A location 
estimator T„ e 3?*' is affine equivariant if and only if for any vector h eW and 
any nonsingular p x p matrix A, 

T„(AX + b) = AT„(X) + b. 

A scale estimator e PDS{p) (the set of positive-definite symmetric p x p 
matrices) is affine equivariant if and only if for any vector b e 3?^ and any 
nonsingular p x p matrix A, 

C„(AX + b)=AC„(X)A^. 

If an estimator is affine equivariant, stretching or rotating the data won't 
affect the estimator. Dropping this requirement greatly increases the number of 
available estimators, and in many cases, non- affine equivariant estimators have 
superior performance to affine equivariant estimators. 

A further important and desirable feature of an estimator is a computation- 
ally fast algorithm to compute it. It is not uncommon to measure data set sizes 
in terabytes or petabytes, and some real time applications require the outliers 
to be detected within seconds or at most a few minutes (as in the case of air 
passenger check- in). In these situations, slow algorithms such as the Minimum 
Volume Ellipsoid (discussed on p. 28) are essentially useless, that is, algorithms 
that are theoretically capable of locating the outliers, but would take millennia 



10 



to do so. 

As previously mentioned, outlier detection in all data situations by the same 
algorithm is not possible, but we can still aim for a reasonable diversity. This is 
called the breadth of algorithm, and we therefore strive to develop an algorithm 
with a large breadth. 

No estimator possesses all of these properties — high breakdown point, 
affine equivariance, computational speed, breadth. Rocke and Woodruff [68] 
stated that it is entirely appropriate to develop special methods to handle special 
cases. For large multivariate data sets, computational speed seems to be one of 
the most difficult goals to achieve. 

In this thesis, I examine some of the leading outlier detection methods, and 
also propose a new algorithm, comparing and evaluating its success and speed 
in a variety of situations, and investigate some useful numerical properties of its 
robust estimator. Specifically, in chapter two, I discuss existing outlier detection 
methods, pointing out strengths and weaknesses and identifying areas where 
improvements are needed. In chapter three, I propose a new algorithm that has 
several significant advantages. In chapter four, I compare the performance of the 
proposed algorithm to existing methods in a variety of data situations, and in 
chapter five, I perform a theoretical analysis of the robust estimation procedure 
and discuss avenues for future research and possible improvement. 
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2. Previous Outlier Identification Methods 

Outlier identification and robust location/scale estimation are closely re- 
lated. Recall that the classical Mahalanobis distance has a breakdown point of 
0. To increase this breakdown point, Campbell [11] proposed the use of robust 
estimators, and specifically, a class called M-estimators, for T(X) and C(X). 
This step marked an important improvement in identifying outliers, with the ro- 
bust version not suffering from the masking effect like classical version. Clearly, 
the performance of this robust version depends on the specific T(X) and C(X). 
As initially introduced, M-estimators were monotone functions and were shown 
by Maronna [52] to have a breakdown point no larger than This is ob- 

viously an improvement over the classical version, but still quite inadequate 
in higher dimensions — with p — 20, the breakdown point is less than 5%. 
Later, more sophisticated M-estimators were developed with breakdown point 
approaching 50%. Nevertheless, using the Mahalanobis distance with robust 
estimators of location and scale was an important conceptual breakthrough and 
led to significantly improved versions of the Mahalanobis distance. Obtaining 
robust location and scale estimates remains one of the most difficult challenges 
in outlier identification; without which, all distance-based methods would fail 
(this includes the majority of outlier identification algorithms.) 

Using a robust estimate of the location and scale of the primary distribution, 
the modified Mahalanobis distance can provide a reasonably good estimate of 
how far a particular observation Xj is from the center of this distribution. A suf- 
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ficiently large value of this distance indicates that the observation Xj is likely an 
outher. In 1996, Rocke and Woodruff [66] stated that all known affine equivari- 
ant methods of solving the outlier identification problem consist of the following 
two stages: 

• Phase I: Estimate a location and scale. 

• Phase II: Cahbrate the scale estimate so that it can be used to suggest 
which points are far enough from the location estimate to be considered 
outliers. 

Several methods yield acceptably accurate estimates of the mean, but per- 
formance regarding robust estimation of the covariance matrix varies widely. 
Additionally, the Mahalanobis distance is much more sensitive to small changes 
(errors) in the covariance estimate C(X) than it is to changes in the mean 
estimate T(X). 

Quantifying the magnitude of Mahalanobis distance at which an observation 
is labeled an outlier is another critical element of these algorithms. In some sit- 
uations, such as normally distributed data, the distribution of squared classical 
Mahalanobis distances follows an F distribution, as shown in [43, ch. 5] and [51, 
pp. 73-75], but not in all situations. Hadi et al. [32] develop robust location 
and scale estimates and accordingly robust Mahalanobis distances for all the 
observations, classifying those points as outliers which have robust Mahalanobis 
distance larger than c^prXfi-ai »> where Cnpr is a correction factor based on n, 
p and r (the number of points on which the location and scale estimates were 
based). We address this issue in detail in section 3.2.2; our study reveals that 
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while this correction factor works well for the classic Mahalanobis distance on 
normally distributed data, it can be quite inaccurate (too high) for modified 
Mahalanobis distances from non-normal data and fails to detect the majority of 
outliers even when an apparently clear separation exists between both groups. 

2.1 The Influence Function 

For a given sample of size n, Hoaghn et al. [38] (and to a lesser extent, 
Thode [37]) discuss the sensitivity curve 5'C(a;|T„), which is computed by cal- 
culating the estimator with and without an observation at the point x, and 
is proportionate to the size of the sample: 

SC{x\Tn) = n(r„(xi, . . . , Xn-l, x) - T„_i(xi, . . . , Xn-\)). 

The sensitivity curve describes the effect of an individual observation on the 
estimator Tn for a specific data set. It is not hard to compute empirically, but 
is not of much theoretical use. When we let the sample size n tend towards 
infinity, the resulting limit is called the influence function. In the absence of 
a sample data set, the influence function is defined with reference to a specific 
distribution., F. 

Hampel [33] introduced the concept of the influence function which he and 
his collaborators popularized in a book [35]. Although the influence function 
approach has been used primarily (but not exclusively) on univariate data, our 
experience has been that it can be advantageous from a computational point of 
view to examine each component separately. We can then combine the respec- 
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tive estimates to calculate the Mahalanobis distances and locate the outUers in 
multivariate space. 

Suppose we have a univariate sample {xi, . . . , Xn) with empirical distribution 
Fn- Letting A^. be the point mass 1 in X, we can define F„ by F„ = ^ Yli=i ^xi- 
To estimate the location parameter 9 of the underlying distribution F, we 
consider the statistics T„ = T{xi, . . . ,Xn) = T{Fn). We can think of these 
as a sequence of statistics {T'„;n > 1}, one for each possible sample size 
n. Asymptotically, we consider estimators which are functionals. That is, 
liuin^ao Tn{x I, ... ,Xn) = T{F), whcrc the convergence is in probability, and 
we say that T{F) is the asymptotic value of {T„; n > 1} at F. The influence 
function IF of the estimator T at the distribution F is given by 

IF(r, T, F) = lim - e)F + .A.) - r(F)^ 

If we replace F by -Fn-i ~ and put e = ^, we can see that IF measures 
approximately n times the change in T caused by an additional observation at 
X when T is applied to a large sample of size n — 1. As discussed in Hampel 
et al. [35], the importance of the influence function lies in the fact that it 
can describe the effect of an infinitesimal contamination at the point x on the 
estimate T, standardized by the mass of the contaminant. It gives us a picture 
of the asymptotic bias caused by contamination in the data. 

The influence function has a number of very appealing properties. Ham- 
pel [33] shows that under conditions usually met in practice, Tn is asymptot- 
ically normal with asymptotic variance equal to the integral of IF^ over F; 
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Huber [40] also discusses these regularity conditions. That is, the distribution 
of {^Jn[T,^-T{F)]) tends to N{Q,V{T,F)), where V{T,F) is the asymptotic 
variance and is equal to 



This formula shows that a bounded influence function implies a bounded 
asymptotic variance, and cautions against using estimators with unbounded 
influence functions (for example, the sample mean). 

Another important property of the influence function is the supremum of the 
absolute value, from which Hampel et al. [35] define the gross- error sensitivity 
Y of T at F as 



where the supremum is over all x where IF{x;T, F) exists. The gross-error 
sensitivity measures the worst influence which a small amount of contamination 
can have on the value of an estimator, and can be regarded as an upper bound 
on the asymptotic bias of the estimator T. Ideally, 7* (T, F) is finite. In many 
cases, putting a bound on 7*(T, F) is the first step in making T robust. 

The influence function can also tell us more about T through the local-shift 
sensitivity X* . Slight changes in some observations, perhaps due to rounding 
or grouping, could have a measurable effect on the estimate. The effect of 
shifting an observation slightly from the point a; to a neighboring point y can 
be measured by means of IF{y; T, F) — IF{x; T, F), because an observation is 




7*(r,F)=sup|7F(x;T,F)|, 



X 
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added at y and removed at x. This can be approximated by the slope of the 
influence function. A* is the worst of these normalized differences: 

\IF{y;T,F)-IF{x;T,F)\ 
A = sup i i . 

Xy^y \y X\ 

Recall that the earliest treatment of outliers was to simply reject them 
entirely, as in Legendre [48] and Edgeworth [20]. In terms of the influence 
function, this means that the IF vanishes completely in a certain region (usually 
beyond a certain distance from the location estimate). In other words, if the IF 
is zero in some region, then all the observations in that region have no effect on 
T. For a symmetric F (and putting the center of symmetry at zero), Hampel et 
al. [35] define the rejection point p* as 

p* = inf {r > 0; IF{x; T, F) = 0 when \x\ > r}. 

If no such r exists, then p* = oo by definition. All observations farther away 
than p* are rejected completely. 

Figure 2.1 shows a scaled version of the influence function, the V'- function, 
for the sample mean and median. For T equal to the sample mean, it can be 
shown that IF(x; T, F) — x, the asymptotic efficiency e — 1 (the reciprocal of 
the variance) — the maximum possible efficiency, and the local-shift sensitivity 
A* = 1, also the lowest possible value for meaningful estimators. However, both 
the gross-error sensitivity 7* = oo and the rejection point p* = 00, which are 
very serious short-comings in the presence of outliers. We deduce that the mean 
is least susceptible to rounding errors and is very efficient, but can break down 
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completely if even just a single far-away outlier is present. 

The influence of the sample median has a discontinuity at the origin due to 
the fact that it changes discontinuously if one of the two center values is changed, 
and is horizontal (but non-zero) everywhere else. The local-shift sensitivity is 
A* = oo because of this discontinuity, while at F = A'"(0, 1) the median has 
efficiency e = ^ = 0.637 and for F = N^/i, cr^) an asymptotic efficiency of 
e = Hampel et al. [35] show that the gross-error sensitivity of the median 
is 7* = 1.253, which is in fact the minimum value at F = A'"(0, 1). Regardless 
of the median's other shortcomings, no estimator can beat the median when 
it comes to robustness. Finally, the rejection point p* — oo, which may seem 
surprising at first glance, but the median does not reject outliers, it only lessens 
their influence, to a considerable degree. It counts rather than measures all 
observations except for the middle one or two, respectively, for an uneven or 
even sample size. 

Based on the above comparison, we can easily conclude that the median is 
much more robust than the mean, at the cost of efficiency and local-shift sensitiv- 
ity. There is still room for improvement over the median, however, particularly 
regarding its efficiency e = ^ = 0.637. Using the JF-based criteria mentioned 
thus far, we explicitly state four goals that we would like our estimator to attain: 

• Possess a low gross-error sensitivity 7*. 

• Possess a finite rejection point p*. 

• Possess efficiency e close to 1. 
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• Possess a low local-shift sensitivity A*. 

Having developed these criteria wherewith we can evaluate the performance 
of various estimators, we now examine several candidates. We start with a class 
of estimators which although not too successful themselves, paved the way for 
other, modified and improved, estimators. 

2.2 M-estimators 

2.2.1 Monotone M-estimators 

The maximum likelihood estimator (MLE) is defined as the value of T„ = 
T{xi, . . . , Xn) that maximizes n"^^/(xj, T„), where / is the estimated density of 
F and is implicitly dependent on the estimated parameter T„. In practice, we 
usually calculate the MLE as the value of T„ such that 

n 

Tn = argnun y^[- In /(xj, T)]. 
1=1 

Huber [39] generalized this minimization problem to 

n 

Tn = arg nun ^ p{xi, T) . 

i=l 

Suppose that p has derivative ■j/', -^(a;, 6) = -§qP{x, 0). Then the estimate T„ will 
also satisfy 

n 

^ll){Xi,Tn) = 0, 
1=1 
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although a r„ can exist as a solution to the latter equation without solving the 
minimization problem. An estimator T„ defined by either of these two equations 
is called an M-estimator. Ruber proposed the name M-estimator because these 
are "generalized maximum likelihood" estimators. Of these two equations, the 
■0-function is more useful in practice. In fact, Hampel et al. [35] show an 
important connection between the IF and the V'-function: 



That is, the ■^-f unction is simply a scalar multiple of the IF. Within this 
framework, we present the p- and ^'-functions of the sample mean and median 
in Figure 2.1. For the sample mean, p[x) = x'^ and ijj{x) = x (ignoring the 
constant 2 as is customary, since it doesn't affect the minimization problem), 
while for the sample median, p{x) — \x\ and p{x) — sgn(x). 

Recall that the sample median has good robustness properties, but suffers 
from low efficiency and infinite local-shift sensitivity A*, while the sample mean 
has the highest possible efficiency at F = A'"(0, 1) and the lowest possible A*. 
Huber [39] combined these properties to produce what he called the Huber esti- 
mator: 



for 0 < 6 < oo. Huber 's estimator has a breakdown point of 50%, just hke the 
sample median, but has much better local-shift sensitivity. For — 6 < a; < 6 it 
also has the same optimal efficiency as the sample mean. The rejection point 



IF{x;^,F) 



ij{x,T{F)) 



JU^{y,e)Up)dF{yy 
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psi-function for mean 
-2-1012 




psi-function for median 
-1.0 -0.5 0.0 0.5 1.0 



rlio-f unction for mean 
0 12 3 4 




rlio-function for median 
0.0 0.5 1.0 1.5 2.0 



p*, however, is still unbounded because for |x| > &, the IF resembles that of the 
sample median. We present a graphical version in Figure 2.2, together with 3 
other robust estimators. 

It would be beneficial to take note of Winsor's principle, quoted by Tukey 
[81, p. 457], which states that "all distributions are approximately normal in 
the middle" . Applied to M-estimators, this calls for the middle portion of their 
■0-functions to resemble the function that is best/most efficient for Gaussian 
data, namely the sample mean. Thus, ^'-functions with optimal efficiency will 
be approximately linear near the origin: ^(m) ~ u. At this point, we can refine 
some of the properties that we would like our idealised estimator to possess: 

• We would like the IF to be linear in a vicinity of zero — this will provide 
good efficiency (close to e = 1) as well as small local-shift sensitivity A* for 
those observations near the center (assuming a symmetric distribution). 

• We would like the IF to be bounded. 

• We would like the IF to return to zero (and remain at zero) for some dis- 
tance sufficiently far away from the origin — resulting in a finite rejection 
point p*. 

The class of M-estimators that satisfies the latter objective are accordingly 
called redescending M-estimators. 

2.2.2 Redescending M-estimators 

Redescending M-estimators are those M-estimators that are able to reject 

extreme outliers completely. That is, there exists an r, 0 < r < oo, such that 
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Andrew's sine function 




Tul<ey's biweiglit function 



Huber's estimator 




ip{x) — 0 V > r. 

One of the earliest proposals for a redescending A/-cstimator was Hampel's 
three-part redescending M-estimator, first introduced by F.R. Hampel in the 
Princeton Robustness Study [3] , an extensive theoretical and Monte Carlo study 
of different robust estimators published in 1972. Hampel's estimator is defined 

by 

i^afiA^) = ^ 0 < |x| < a 

= a sign(a;) a < \x\ < b 

— ct7zf^sign(x) 6 < |x| < r 

= 0 kl ^ 

where 0<a<6<r<oo. This estimator demonstrated good performance in 
the Princeton Robustness Study, and is shown in Figure 2.2. 

The lack of differentiability of ipa,b,r is not ideal, however, and a smooth 
V'-function would be preferred. This led to the development of Andrews' sine 
function, also introduced in the Princeton Robustness Study [3] and defined by 




where I{x) is the indicator function. We graph the V'-function in Figure 2.2. 

In 1974 Beaton and Tukey [7] proposed the biweight function (short for 
6isquare weight), a smooth function that has been successfully used in a wide 
variety of applications, and depicted in Figure 2.2. In the context of ■^-f unctions, 
it can be defined as 
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The term bisquare refers to the square of the second term, which ensures 
continuity for both ip{x) and ip'{x) (useful during certain optimization routines). 
The biweight has been used in as diverse apphcations as biostatistics (quantal 
bioassay — Miller and Halpern [56]), materials science (regression of rubber 
abrasion loss — Cleveland [14]) and economics (computation of Nigerian cur- 
rency exchange rate — Okafor [58]). It has also encouraged further research 
regarding various properties of the ■0-function and accordingly, forms the ba- 
sis of several successful algorithms including the biflat estimate introduced by 
Rocke [64]. 

2.2.3 Computational Issues Regarding M-Estimators 

Computing an M-estimator involves a numerical optimization problem 
which frequently does not have a unique solution. Upon re-examination of the 
defining equation, 

II 

i=l 

it is clear that for redescending M-estimators, the LHS will vanish for all \x\ suf- 
ficiently large. It might be possible to find the global minimum of Y17=i Pi^ii ^n) 
instead, but this is not always computationally feasible. Rocke and Woodruff 
[84] develop a compound estimator, formally introduced on page 31, which con- 
sists of two stages: the first utilizes a combinatorial estimator such as MVE 
or MCD, defined on page 28, to locate a good starting point, which can then 
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be used for the minimization stage. Hampel et al. [35] propose to either (a) 
select the solution which is nearest to the sample median, or (b) use Newton's 
method starting with the sample median. Our experience with the biweight has 
been that starting from the sample median, convergence to the desired value is 
successfully and quickly achieved. 

M-estimators are not invariant with respect to scale, as defined up to this 
point. Fortunately, scale invariance can easily be achieved by defining T„ as the 
solution to: 



where Sn is a robust estimate of scale determined beforehand. The median 
absolute deviation (MAD) has the maximal breakdown 50% and is recommended 
for use in M-estimators by Hampel et al. [35, p. 105], and by Huber [40, p. 
107], who called the MAD "the single most useful ancillary estimate of scale". 
Because the expected value of the MAD is 0.6745cr for a Gaussian distribution, 
a robust scale estimate based on the MAD is usually defined as 

Sn = 1.483 • MAD{x} = 1.483 • medianj{|xi - median j{xj}\}. 

See for example, Mosteller and Tukey [57, p. 205]. Various software packages 
scale the MAD differently; the Splus routine mad defines the MAD as above. 

2.3 ^'-Estimators 

A class of estimators related to M-estimators are the -S-estimators, intro- 
duced by Rousseeuw and Yohai [72] and so called because they are based upon 
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minimization of a scale statistic. An important distinction between S- and M- 
estimators is that ^'-estimators provide simultaneous estimates of both location 
and scale, while M-estimators produce estimates for only one of these quantities. 
For a multivariate data set (xj, . . . ,x„), the S'-estimate of location and scale is 
defined as the vector T and matrix C e PDS(p) that jointly solve 

min I C I 

subject to 

^Ep(V(x.-TrC-nx,-T))=6o 

1=1 

for 60 a tuning constant. The constraint equation is often compactly written as 

1 " 

-y^pidi) = bo, 
1=1 

where di is a robust version of the Mahalanobis distance, p is usually chosen to 
be a scaled version of some base function, such as the biweight. Lopuhaa [50] 
showed that the breakdown point is given by the ratio of 60 to the maximum of 
p. Rocke [64] showed that ^'-estimators in high dimension can be very sensitive 
to outliers, even with high breakdown point, illustrating yet again the difficulty 
of obtaining a robust estimate of multivariate scale. 

Multivariate ^'-estimators pose a significant computational challenge. The 
^'-estimator is defined as the location estimate T and PDS matrix C that achieve 
the unique global minimum; it can be very difficult to guarantee finding a global 
minimum. In practical settings, the computed ^'-estimator may differ greatly 
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from the theoretical S'-estimator. This is illustrated with the "bushfire" data 
set. Originally published in a CSIRO report [12] and reprinted in Maronna and 
Yohai [53], this data set helped to locate bushfire scars in Australia by satellite 
measurements of n = 38 pixels in p = 5 frequency bands. Using the same 
theoretical ^'-estimator, Becker and Gather [8] labelled different observations 
as outliers than Maronna and Yohai [53]. Becker and Gather [8] ascribe this 
to the fact that different computational algorithms were used to minimize the 
objective function, and that due to a different starting point, Maronna and 
Yohai's algorithm may have converged to a local minimum instead of the global 
minimum. In the words of Rocke and Woodruff [84], "the algorithm is the 
estimator" . 

For difficult minimization problems like S'-estimation, a good starting point 
is of the utmost importance; it is worth investing considerable time and effort 
in pursuit of such. Rocke and Woodruff [84] proposed a procedure called a com- 
pound estimator, which obtains initial estimates of location and scale through 
one type of robust estimator, which are then used as starting points for the more 
powerful yS-estimator. The estimator that Rocke and Woodruff use to obtain 
their initial estimates is a combinatorial estimator, described in the following 
section. 

2.4 Combinatorial Estimators 

In 1985 Rousseeuw [69] introduced the Minimum Volume EUipsoid (MVE) 

and Minimum Covariance Determinant (MOD). These are both affine equivari- 

ant estimators with high breakdown points of ^ approaching | for large 
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n. The location estimate provided by the MVE is the center of the eUipsoid 
of minimal volume containing approximately half of the points in X, while the 
covariance estimate is provided by the shape matrix of this ellipse. Lopuhaa and 
Rousseeuw [50] add a small modification, which results in a slightly higher break- 
down point of Jillz^illM ^ Qj^^ which calculates the location estimate T e 3?^ 
and covariance estimate C G PDS(p) according to 



min |C| 

subject to 

#{z:(x,-T)^C-^(x,-T)<c2}> 

where # denotes cardinality. For Gaussian data, Lopuhaa and Rousseeuw [50] 
show that a good value for is Xo.5,p' ^he value for which /Lt)^S~^(X— 
m) < c^} = \. Rousseeuw and Leroy [70, p. 259] give an algorithm for the MVE. 

The MCD is similar to the MVE and has the same objective function. The 
difference is that in the constraint, the MCD requires only that the estimate be 
formed using half the points, instead of the ellipse containing half the points. It 
follows that both the MVE and MCD are too cumbersome to compute precisely, 
except in small data sets. Several different algorithms have been proposed, 
but of necessity they can be only approximations. These are usually based on 
various optimization techniques applied to smaller subsamples of the data, and 
as such can never guarantee the desired result. Standard optimization techniques 
cannot be applied in the usual sense because the criteria are nonconvex and 



n + p + 1 
2 
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nondifferentiable. There has been some innovative research done in this regard, 
with optimization methods ranging from Glover's tabu search [26], [27], apphed 
to the MOD by Rocke and Woodruff [84], to a steepest descent algorithm with 
random restarts proposed by Hawkins [36]. This is another case where "the 
algorithm is the estimator" . 

It later became apparent that the MCD was vastly superior. Rockc and 
Woodruff [84] found that with p — 25 and 15% contamination, the MCD took 32 
seconds to produce good enough starting values for the ^'-estimator to converge, 
whereas after 8,192 seconds the MVE had still not converged. For the 20 x 5 
"wood specific gravity" data set [70, p. 243], Cook and Hawkins [15] had to 
examine 57,000 cases (eUipses) before finding the MVE. Since this was still 
not guaranteed to be the correct answer, they allowed the algorithm to run to 
completion, i.e., full enumeration, which took 50 minutes on an IBM PC/AT. 
Even on this outdated computer, for a mere 20 x 5 data set we could certainly 
wish for more. 

It appears that an inherent weakness of both the MVE and MCD is that the 
computation time grows too fast to be useful in moderate and large data sets. 
Hawkins [36] proposed an efficient version of the MVE where the search time 
grows with pn. This leads to what Rocke and Woodruff [84] call the "perverse 
situation" , in which the analysis can actually be made worse by the collection 
of more data — as the size of the data set grows, so too does deterioration in 
the algorithm's performance. 
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2.5 Compound Estimators 

In 1994, Rocke and Woodruff [84] proposed a new class of estimators called 
compound estimators, which utilize the complementary strengths of combina- 
torial and ^-estimators. To remove the paradox of additional data degrading 
the estimator's performance, they randomly partition the data into L:^J cells, 
where 7(p) is the minimum number of observations required in each cell. They 
then apply the MCD algorithm to each cell, constrained by a predetermined 
time hmit. If the researcher can afford to use t seconds for this stage of the 
algorithm, the MCD algorithm is only allowed to spend n seconds searching 
each cell. This will result in L:^J different location and scale estimates; the best 
of these is chosen as the starting point for the ^'-estimator algorithm, calculated 
on the unpartitioned data set. The final solution is given by the ^'-estimator; 
that is, the location and scale estimates producing the smallest determinant of 
the covariance matrix, |C|. 

Successful execution of this method requires coming fairly close to the de- 
sired location and shape in the first stage (MCD applied to each data cell), which 
would lead the ^'-estimation algorithm to converge to the correct root. More 
data would help, not hinder, this estimator, as there is an increased chance that 
one of the initial L:^J cells would contain fewer outliers, provide a good starting 
point and aid convergence to the desired minimum. (Only one such "lucky" cell 
is required.) Rocke and Woodruff distribute C code for this algorithm, which 
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they call MULTOUT. 



2.6 BACON 

Hadi et al. [32] describe a fast, robust method called BACON — Blocked 
Adaptive Computationally-efficient Outlier Nominator. BACON is an iterative 
method, the core of which consists of identifying a basic subset presumed to 
be outlier free, from which the sample mean and covariance matrix can be 
estimated, and followed by robust Mahalanobis distances. 

To construct the first basic subset, Hadi et al. [32] provide two options. The 
first simply computes the sample mean and covariance of all n observations, fol- 
lowed by the Mahalanobis distances (or discrepancies di, as Hadi, et al. [32], 
call them). Those observations with the cp smallest diS form the initial basic 
subset, where c = 5 is typical and not very influential in the algorithm's conver- 
gence. This procedure is affine equivariant but not at all robust; our experience 
has been that in data sets containing a substantial amount of outliers, the algo- 
rithm falsely converges and fails to detect them. The second option computes 
the coordinate- wise median m, and thereafter ||xj — m|| for all n observations. 
Those cp observations with the least distance to m form the first basic subset. 
This start is robust, but not affine equivariant. Subsequent iterations are affine 
equivariant, however, so the overall algorithm is nearly affine equivariant. We 
hst the steps required for BACON as follows: 

1. Select an initial basic subset according to either of the two options, and 
calculate the sample mean and covariance C(, of this subset. 

2. Compute the discrepancies di, according to 
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di{T,,Cb) = ^(x,-T6)^C,-^(x,-T6). 

3. Set the new basic subset to contain all points with discrepancy di less 
than Cn„r , /y^ a, , where c„„r — + c/ir is a correction factor; h — 

V (^-n)'P 

l^ n+p+i j ^ ^ -g ^j^g g-^g ^£ ^j^g current basic subset, Chr = max {O, j^}, and 
c„„ = 1 + H r — = 1 + H T"^, assuming n + p + 1 is even. 

"P n—p n—h—p n—p n— 1— 3p' * ' ' 

4. Repeat steps 2 and 3 until the size of the basic subset remains constant. 

5. Nominate the observations excluded by the final basic subset as outliers. 

Hadi et al. [32] add the correction factor in step 3 because when r <^ /i ~ |, 
the elements of the covariance matrix tend to be smaller than they should be. 
This method is a faster version of a "forward search" algorithm proposed by 
Hadi in [30] and modified in [31]. These methods are similar to BACON but 
in step 3, add only the single point with the least di provided it is less than a 
certain threshold value). Several variations of forward search algorithms have 
been examined in the literature e.g. in [37] and [6], but they usually converge 
very slowly for large n. Similarly, backward search algorithms (also discussed 
in [37] and [6]) have been proposed which start with all n observations, and 
at each step discard the single observation with the furthest distance, provided 
it is above a certain minimum threshold. BACON is a significant computa- 
tional improvement because it can move large blocks of observations in a single 
step. Our study of BACON indicates that it is indeed computationally fast, and 
demonstrates very good success amongst normally distributed data, but does 
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not perform as well on non-normal data. We will analyze its performance in 
greater detail in chapter 4. 

2.7 Non-Distance Bcised Methods 

All of the methods mentioned so far base their outlier identification proce- 
dures on a robust version of the Mahalanobis distance. While this is certainly the 
most popular approach, it is not the only one. In 1974 Friedman and Tukey [25] 
introduced a novel procedure which they called projection pursuit. Their pur- 
pose was to find interesting structure in the data or unexpected characteristics 
that might not be readily apparent. In addition to several other applications not 
originally envisioned by Friedman and Tukey including even multiple regression 
[24], projection pursuit also lends itself very well to outlier identification. 

For data sets following an elliptically symmetric distribution like the multi- 
variate normal, the usual summary statistics such as sample mean, covariance, 
etc. provide enough information to generally understand it. With other data 
sets, however, important information or structures are frequently not adequately 
captured by these statistics. The goal is thus to pursue projections in which the 
data reveal interesting structures when viewed by the researcher in low dimen- 
sion, relying on what Friedman and Tukey [25] call "the human gift for pattern 
recognition" . The projection pursuit algorithm assigns a numerical index, called 
the projection index, to certain one- or two-dimensional projections of the data 
characterizing the amount of structure present in that projection, as can be 
calculated through the data density variation. This projection index is then 
numerically maximized with respect to the parameters defining it to obtain the 
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"most interesting" projection. The projection index can, and should, according 
to Huber [41], be afhne equivariant. After a certain projection has been viewed, 
the particular structure that it revealed is removed, so that other interesting 
projections can be displayed, similar to principal components analysis, which is 
actually a special case of projection pursuit. 

In a comprehensive invited article, Huber [41] states that projection pursuit 
methods are able to ignore noisy and information-poor representations. This 
gives them a distinct advantage over methods based on inter-point distances 
such as multidimensional scaling and clustering techniques, which have both 
been used for outher identification — see [78] , [59] and [67] . Friedman and Tukey 
[25] originally decided to define data as uninteresting if it appeared to have a 
normal distribution; Huber [41] gave heuristic arguments to support this idea. 
To highlight interesting features of the data, they therefore define the projection 
index to measure, and subsequently maximize, departure from normality. Due to 
a renewal of interest in projection pursuit in the 1980 's, Friedman [23] updated 
their original algorithm. According to Huber [41], projection pursuit methods 
have primarily one serious drawback — their high demand on computational 
time. This will subsequently be shown to be quite severe indeed. 

2.7.1 The Stahel-Donoho Estimator 

The most well-known outlier identification method based upon the projec- 
tion pursuit concept, is the Stahel-Donoho estimator, an affine equivariant esti- 
mator with a breakdown point approaching 50%. This estimator was developed 
by Stahel and Donoho in [79] and [19] , and became better known when Maronna 
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and Yohai [53] published an analysis of it in 1995. 

The Stahel-Donoho estimator is defined as a weighted mean and a weighted 
covariance matrix, where each observation's weight is inversely proportional to 
how "outlying" it is. This outlyingness measure is based upon the projection 
pursuit idea that if a point is a multivariate outlier, there must be some one- 
dimensional projection of the data in which this point is a univariate outlier. 
Using a particular observation as a reference point, the Stahel-Donoho algorithm 
determines which directions have optimal values for a pair of robust, univari- 
ate location/scale estimators, then uses these estimators to assign weights to 
the other points. Employing the median and the MAD (for a breakdown point 
approaching 50%), necessitates solving a global optimization problem with dis- 
continuous derivatives, for each observation. Selecting smooth estimators does 
not help very much because this results in the objective function possessing 
several local maxima, so that gradient search methods are rendered unsuitable. 
Maronna and Yohai [53] infer that the reason not much attention has been paid 
to the Stahel-Donoho estimator is due to other "more tractable" estimators such 
as the MVE and ^'-estimators. Maronna and Yohai [53] present an algorithm 
for this estimator, which works very well for p = 2, and state further that no 
known feasible method exists for large p. A simulation study by Juan and Prieto 
[44] with n = 100 and p = 5, 10, 20 comparing the MCD, MULTOUT, and the 
Stahel-Donoho estimator, showed the latter outperforming the other two esti- 
mators, although they all approached 100% failure rate at contamination level 
e = 10% for p = 20, at e = 15%/orp > 10, and at e = 20%/orp > 5. We 
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conclude that this method is not viable for large data sets, but it has given rise 
to at least one other successful method of outlier identification. 

2.7.2 Kurtosis 

One way of reducing the computational cost of the Stahel-Donoho estima- 
tor is to reduce the number of projections that need to be examined. Peiia and 
Prieto [60] propose a method which they call Kurtosisl. Kurtosisl involves pro- 
jecting the data onto a set of 2p directions, where these directions are chosen to 
maximize and minimize the kurtosis coefficient of the data along them. Kurtosis 
is a measure of how peaked or flat the distribution is. Data sets with high kur- 
tosis tend to have a sharp peak near the mean, decline rather rapidly, and have 
heavy tails, while data sets with low kurtosis tend to have a flattened peak near 
the mean; the uniform distribution is an extreme example. The kurtosis coeffi- 
cient is also inversely related to bimodality. A small number of outliers would 
thus cause heavy tails and a larger kurtosis coefficient, while a large number 
of outliers would start introducing bimodality and the kurtosis would decrease. 
Viewing the data along those projections that have the maximum and minimum 
kurtosis values would display the outliers in a more recognizable representation. 

For large data sets, the Stahel-Donoho algorithm requires projecting the 
data onto randomly generated projections because full enumeration is infeasible, 
and requires investigation along a very large number of projections in order to 
be successful. By drastically reducing these directions, Kurtosisl aims to signif- 
icantly improve computational speed without sacrificing accuracy. Successfully 
solving the optimization routine to find the maximum or minimum kurtosis val- 
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ues necessitates reaching the global maximum or minimum. Since this is not 
efficient, we have to settle for a set of p local maximizers and p local minimizers. 
Pefia and Prieto [60] show how computing a local maximizer/minimizer would 
correspond to finding either (a) the direction from the center of the data straight 
to the outliers, which is exactly what we are looking for, or else (b) a direction 
orthogonal to it. They then project the data onto a subspace orthogonal to 
the computed directions and rerun the optimization routine. This process gets 
repeated p times, so in total 2p directions are examined, and we expect with 
high probability that the outliers lie along at least one of these directions. 

For each of these 2p directions, Pena and Prieto [60] use a backward search 
algorithm based on the univariate median and MAD to find potential outliers. 
Based upon the sample mean and covariance of all points not labeled as po- 
tential outliers, robust Mahalanobis distances MD* are calculated for all the 
observations. Those points with MD*^ < Xo.99,p not considered outhers and 
are put back in the set of inliers. This process is repeated until convergence 
(usually only a few iterations.) 

Our study of this method shows that it does indeed do a good job of detecting 
outliers, for a wide variety of outlier types and data situations, although it 
tends to produce a relatively large number of false positives. An interesting 
feature is that its success rate remains approximately the same for outliers both 
near and far, whereas distance-based methods demonstrate a marked increase 
in performance for distant outliers. Although it is a significant improvement 
over the Stahel-Donoho estimator, computational speed still remains a serious 
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weakness, especially with regard to p. 

2.8 Types of Outliers 

Certain types of outliers are easier to find than others. Some algorithms 
generally succeed at detecting a certain type of outlier, but fail miserably with 
another type of outlier. Clearly there are an infinite number of outlier types, but 
some are more common than others. It is common practice when generating data 
for the purpose of testing outlier algorithms, to assume the primary distribution 
is an uncorrelated multivariate normal, and that the outliers come from another 
multivariate normal with mean /j, and covariance matrix S. If the estimator is 
afRne equivariant, we can assume without loss of generality that the primary 
distribution has mean 0 and covariance I. Rocke and Woodruff [65] mention a 
useful framework to consider multivariate normal outlier types when the primary 
distribution is A^p(0,I): 

• Shift Outliers: The outliers have distribution Np{fjL,I). 

• Point Outliers: These outliers form a densely concentrated cluster, tending 
towards a point mass. 

• Symmetric Linear Outliers: Given a unit vector u, the outliers have dis- 
tribution uA''p(0, S), for arbitrary S. 

• Asymmetric Linear Outliers: Given a unit vector u, and its coordinate- 
wise absolute value \u\, the outliers have distribution |u|A^p(0, S). 

• Radial Outliers: The outliers have distribution A^(0,A;I), where k > 1. 
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Linear outliers are a commonly used outlier type. Radial outliers are too 
easy to reveal possible differences between outlier algorithms (all of the outhers 
will usually be found by all of the algorithms). Rocke and Woodruff [66] show 
that for distance-based methods, point outliers and shift outliers are the most 
difficult to detect. For shift outhers, the expectation of the Mahalanobis distance 
between the outlier and primary distribution means is least when the covariance 
matrices are similar. 

Point outhers are also difficult for most distance-based methods, and these 
outliers are perhaps best detected with algorithms specifically designed to detect 
such densely concentrated outlier clusters. Standard clustering techniques as 
described in, for example, Johnson and Wichern [43, ch. 12], do not appear to 
perform very well in detecting clusters of this type. In order to detect densely 
concentrated clusters, several authors have developed special techniques for this 
purpose, such as Peiia and Prieto [59] and Rocke and Woodruff [67]. 

Wc briefly comment on the near exclusive use of the multivariate normal 
distribution when generating sample data sets for outlier algorithms. Our study 
shows that an algorithm's performance can change drastically when switching 
to a different type of distribution. For example, BACON demonstrates virtually 
perfect performance in a variety of normally distributed outlier types, but when 
analyzing data from a mixture of ^ distributions, shows near 100% failure rate 
in many situations. There is no guarantee that observed data follows a normal 
distribution in practice, and we feel that in general, more testing of algorithms 
on non-normal data should be conducted. 
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We thus conclude our tour of past and current outlier methods and in the 
following chapter, propose a new algorithm for outlier detection. 
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3. The Development of a New Method 

3.1 PreUminciry Investigation 

As a preliminary step to developing a new outlier detection method, we 
briefly examined and compared existing methods to determine possible areas of 
improvement. We selected four algorithms that appeared to have good poten- 
tial for finding outliers: MCD (using an improved version called fastMCD [71] 
implemented as cov.mcd in Splus), BACON (a DOS executable version of the 
code was obtained from the author, Prof. A.S. Hadi), MULTOUT — the hybrid 
algorithm of Rocke and Woodruff (a C version of the code was obtained from the 
author's website) and Kurtosisl (a MATLAB version of the code was obtained 
from the authors). In accordance with most other authors, such as Rocke and 
Woodruff [66], we use a multivariate version of Tukey's contaminated Normal 
model [81], (1 - e)iV(0,I) + eN{6e,\I), where e = (1,...,!)^, and S = 10. 
We considered p = 5, 10, 20 and e = 10%, 20%, 30%, 40%. The values of A we 
consider are A = 0.1 and A = 1.0 which correspond to a concentrated cluster of 
outliers (point outliers) and shift outliers, respectively. Radial outliers, A = 5.0, 
are not very interesting or informative since all the methods yielded 2% error 
or less. None of the algorithms except cov.mcd were suitable for repeated sim- 
ulatons in the given format, so for each parameter combination we carried out 
10 simulations with n — 1,000 points, generated by the Splus random number 
generator. The exception was Kurtosisl, which due to MATLAB's restriction 
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on large data sets, was designed for only n — 750; these points were randomly 
drawn from the original set of 1,000 points. 

Frequently, results of this type are presented in a 2 x 2 table showing success 
and failure in identifying outliers and non-outliers, which we henceforth call in- 
liers. It is also possible to calculate various indices such as sensitivity, specificity 
and positive /negative predictive power, terms which are commonly used in the 
medical literature. Throughout this thesis, we present the percentage false neg- 
atives followed by the percentage false positives in the same cell; we henceforth 
refer to these respective values as the outlier error rate and inlier error rate. 
We propose these names because the outher error rate specifies the percentage 
errors recorded within the group of true outliers, and similarly for the inlier 
error rate. Note that these respective error rates correspond to the complement 
of the sensitivity and the complement of the specificity. This is more compact 
than a series of 2 x 2 tables and allows us to easily focus on bringing the error 
rates down to zero. Nevertheless, 2x2 tables are very descriptive and frequently 
used in existing literature, so we illustrate the connection between our notation 

Table 3.1: Sample 2x2 table illustrating the notation used in this thesis 





Predicted Outhers 


Predicted Inhers 


True Outliers 


a 


b 


True Inliers 


c 


d 



Outlier Error Rate = Inlier Error Rate = 
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and a standard 2x2 table in Table 3.1. 

Results from these simulations are shown in Table 3.2. The values 50.0/10.3 
in the upper left cell of this table mean that for this simulation, MCD failed 
to identify 50% of the outliers, and mistakenly identified 10.3% of the regular 
points [inhers] as outliers. This corresponds to sensitivity 50.0%, specificity 
89.7%, positive predictive value 67.6% and negative predictive value 80.7%. 

Computational time was not a big issue for these simulations due to the 
small data size. For BACON, we used the second option of obtaining an initial 
subset (coordinate-wise medians) because performance with the other option 
was markedly inferior. MULTOUT's C code took shghtly longer to execute 
than the Kurtosisl's MATLAB code (on the order of one minute), which might 
be a source of concern for larger data sets, but did not pose any difficulties for 
n = 1, 000. Examination of table 3.2 reveals that both MCD and MULTOUT 
are incapable of detecting concentrated outlier clusters. (This could be due to 
the algorithms using an initial value originating from a cell consisting solely of 
outliers, which would have a significantly smaller covariance than a cell consist- 
ing soley of inhers, followed by convergence to the wrong solution.) BACON 
demonstrates a perfect score, while Kurtosisl is usually quite accurate with a 
few exceptions. The exceptions for Kurtosisl occur when it missed all or nearly 
all of the outliers on one or two of the ten runs. For 5 — 10, however, near per- 
fect performance should be expected as there is virtually zero overlap between 
the inlier and outlier groups. 
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Table 3.2: Preliminary simulations with multivariate normal outliers 









MCD 


BACON 


mi;ltoi;t 


Kurtosisl 


p = 5 


e = 0.3 


A = 0.1 


50.0/10.3 


0/0 


0/0.7 


0/0.8 






A= 1 


0/2.7 


0/0 


0/0.8 


0/0.5 




e = 0.4 


A = 0.1 


100/35.6 


0/0 


100/16.9 


0/0.9 






A = 1 


0/2.5 


0/0 


0/0.8 


10.0/0.7 


p= 10 


e = 0.2 


A = 0.1 


50.0/7.0 


0/0 


0/1.0 


0/1.0 






A = 1 


0/2.7 


0/0 


0/0.8 


0/1.0 




e = 0.3 


A = 0.1 


100/22.7 


0/0 


70.0/5.8 


0/0.9 






A = 1 


0/2.8 


0/0 


0/1.0 


0/1.0 




e = 0.4 


A = 0.1 


100/49.1 


0/0 


100/22.1 


0/0.8 






A= 1 


0/2.5 


0/0 


0/0.7 


0/0.8 


p = 20 


e = 0.1 


A = 0.1 


0/3.6 


0/0 


20.0/1.2 


0/1.0 






A= 1 


0/3.6 


0/0 


0/0.9 


0/1.1 




e = 0.2 


A = 0.1 


100/17.6 


0/0 


100/5.3 


0/1.0 






A = 1 


0/3.0 


0/0 


0/0.9 


18.5/1.2 




e = 0.3 


A = 0.1 


100/38.2 


0/0 


100/12.9 


11.0/5.6 






A= 1 


76.0/2.9 


0/0 


29.5/1.1 


0/1.2 




e = 0.4 


A = 0.1 


100/76.0 


0/0 


100/34.5 


0/1.0 






A = 1 


96.0/3.5 


0/0 


99.1/1.6 


0/1.0 
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The data in Table 3.2 were generated from multivariate Normal distribu- 
tions. An interesting question not usually addressed in the literature is how 
these methods would perform with non-Normal data. We therefore carried out 
a similar experiment, but this time we generated the data from a mixture of un- 
correlated distributions — the primary distribution (describing the inliers) 
is xh with outliers drawn from x^g. These results can be seen in Table 3.3, also 
based on the mean over 10 iterations. 

Several differences are apparent from the multivariate Normal data. The 
most noticeable is BACON's significant inability to detect outliers for p = 5 
and the higher contamination levels for p — 10. MCD continues to produce rela- 
tively high numbers of false positives, while MULTOUT and Kurtosisl produce 
better overall results. 

3.2 The Development of a New Method 

3.2.1 Obtaining Robust Estimates 

Since it was our purpose to develop an outlier detection method suitable for 
large data sets, we considered computational speed a very important criterion. 
We therefore chose not to pursue the projection pursuit type algorithms, such as 
Kurtosisl, but to focus on distance-based methods instead. A key ingredient of 
a successful distance-based method is a robust estimate of the covariance matrix. 
Despite being more sensitive to outliers (i.e. more difficult to estimate robustly) 
than the location estimate, the Mahalanobis distance is also extremely sensitive 
to an incorrect scale estimate and could be rendered incapable of separating the 
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Table 3.3: Preliminary simulations with outliers 











iV± \J ljL yj VJ L 


xvUl lUolol 


p=5 


e = 5% 


0/12.2 


4o.o/0.01 


0/6.2 


A o In A 

0.2/6.9 




e = 10% 


O/lzA 


91.4/0.03 


0/6.56 


0.2/6.65 




e = 20% 


0/10.3 


97.0/0.01 


A / r r» 

0/5.2 


r\ c\ //*» ^ 

0.2/6.7 




e = 30% 


0/9.2 


99.3/0 


1.0/4.0 


5.5/5.6 




e = 40%) 


0/6.3 


99.5/0 


80.1/1.0 


17.2/3.6 




e = 5%) 


0/13.7 


0.8/0.2 


A /T a 

0/7.4 


A /T a 

0/7.9 




1 no/ 

e = 10/0 


0/12. 0 


0.4/0.2 


0/7.5 


0/7.6 




e = 20% 


0/11.6 


10.3/0.2 


0/5.7 


0/8.0 




e - 30% 


0/9.5 


39.3/0.2 


0/5.7 


0/7.3 




e = 40% 


0/7.9 


79.1/0.1 


0/4.9 


5.6 


p=20 


e = 5% 


0/15.4 


0/0.2 


0/8.0 


0/7.6 




e = 10% 


0/13.8 


0/0.3 


0/7.3 


0/8.3 




e = 20% 


0/12.4 


0/0.3 


0/6.3 


0/7.6 




e = 30% 


0/9.5 


0/0.4 


0/6.0 


0/6.5 




e = 40% 


0/9.3 


0/0.2 


0/5.0 


0/5.6 
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outliers and inliers. We therefore devoted considerable time to obtaining a fast, 
robust estimate of the covariance matrix. 

Of the four estimators shown in Figure 2.2, we prefer Andrews' sine function 
and Tukey's biweight function, because they are smoother than the other two 
estimators. The biweight has been successfully used in a variety of situations, so 
we also decided to use it. For p = 2, Mosteller and Tukey [57, ch. 10] calculate a 
highly robust covariance matrix based on biweight robust regression. To achieve 
this, however, they sacrifice component invariance, i.e. the results from calculat- 
ing var{x) followed by var{y) are different than first calculating var{y) followed 
by vaT{x), and hence cov{x,y) also differs from cov{y,x). Mosteller and Tukey 
[57] provide an extension for p = 3 and higher, but the computational burden 
increase dramatically because estimation of the off-diagonal elements involves 
multiple regression procedures. This procedure would not be feasible for high 
dimensions. We felt that the price of invariance and additional computational 
burden were too high for large data sets, so did not pursue this direction further. 

Analyzing each component separately, we subsequently investigated the 
weights assigned to observations by the biweight function. Let the univari- 
ate observations be Xi, . . . ,Xn with corresponding weights Wi, . . . , Xb^, the 
location estimate, which we initially take to be the median; S a scale estimate 
based on the median absolute deviation from x^w' S — median{|a:i — and 
c a tuning constant. The weights Wi are then iteratively calculated according to 
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I 0, otherwise, 

Xbw — —=-—, and S — medianUxi — Xbw\}. 

Mosteller and Tukey [57] recommend c = 6 or c = 9; Hoaglin et al. [38] 
examine the biweight's robustness properties, such as gross error sensitivity, and 
performance (efficiency) for these and other values. The bi weight has a higher 
efficiency at c = 9 because less data is rejected, but better (i.e., lower) gross 
error sensitivity at c = 6. Kafadar [45] also conducted a study investigating the 
biweight's efficiency for different values of c, sample sizes and distributions and 
found that its efficiency increases with both sample size and tuning constant c at 
the Gaussian. At other distributions such as One- Wild and Slash, its efficiency 
increases with sample size but decreases with c. Note that c—oo corresponds 
to the sample mean, because then all the observations have weight equal to one. 
For instance, at c = 6, Hoaghn et al. [38] show that the asymptotic variance of 
the biweight at the standard Gaussian is 1.094, so that its asymptotic efficiency 
relative to the mean (on Gaussian data) is e = 91.4%, and the gross error 
sensitivity is 7* = 1.68. For comparison, recall that the mean has 7* = 00 and 
for the median, 7* = 1.25. The asymptotic efficiency at c = 9 is e = 98.2%, 
with 7* — 2.05. Mosteller and Tukey [57, p. 206] note that the performance 
of an estimator with efficiency above 90% is indistinguishable in practice from 
one with an efficiency of 100%, so from a theoretical viewpoint, c = 6 seems 
preferable when dealing with outliers. Since the expected value of the MAD 
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equals 0.6745(7 at the Gaussian distribution, the rejection point with c — 6 
equals roughly Aa. We found that c — 6 works better than c — 9 because it 
limits potential outliers from being included in the initial location and scale 
estimates; it is very important that these estimates accurately describe only 
the inliers. We did not investigate c — 3; this would yield a rejection point of 
0.6745 • 3(7 ~ 2a, which would reject too many inliers and underestimate the 
scale. 

Another computational matter is convergence. Kafadar [45] found that three 
to six iterations were sufficient for convergence to 0.0005 of the biweight scale 
statistic. For our intended applications, namely large data sets, computational 
speed is very important, so we used 3 iterations. We do not store the value of the 
biweight location or scale, and will shortly discuss how we found better results 
when reassigning the biweight weights to either zero or one, so the only critical 
aspect is whether the weights are above or below the reassignment boundary. 

After fitting the biweight to each of the p components, we used the weights 
thus obtained to calculate a weighted mean and covariance matrix. For the 
covariance, we used the product of the weights for the different coordinates as 
follows: C(i, j) = ^^(^k ) -^^^j^ere the superscript indicates the 

dimension number {1 . . .p) and the subscript the observation number (1 . . . n). 
Investigation into the covariance matrices obtained in this manner revealed that 
they were frequently too small. This occurred because most of the inliers re- 
ceived weights in the range 0.90-0.99, so that their full contributions to the 
sample covariance were not counted. This was significant because points near 
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the edge of the primary distribution received reduced weights, yet their full 
weights were necessary for an accurate estimate. To correct this problem, we 
focused attention on producing an initial, smaller set of outlier-free data from 
which we could calculate an unweighted sample mean and covariance. This is 
a similar idea to BACON, but we achieve the result in a very different man- 
ner. To this end, we reassign a weight of 0 to all points with biweight-weight 
of less than 0.3, and a weight of 1 to those points with biweight-weight of 0.3 
and greater. This value of 0.3 was initially arbitrarily chosen; we selected a 
relatively small value so as not to discard too much data. We subsequently refer 
to the value where this separation is made as the cutoff value — determining a 
suitable cutoff value for a wide variety of data situations is an important part 
of this investigation. 

We illustrate the superiority of this modification with a tricky set of cor- 
related Poisson data, following a method described in [42, ch. 37]. Consider 
a data set with n = 10, 000 and p = 5. Generate the first 8,000 observations 
from a Poisson(lO) distribution, and the first component of the next 2,000 ob- 
servations from a Poisson(40) distribution. The remaining 4 components of the 
last 2,000 observations are again generated from a Poisson(lO) distribution. To 
obtain correlated data, we generate another 10, 000 x 1 vector Y from a Pois- 
son(lO) distribution, and add this vector to each component of the data matrix. 
Let Yi, . . . , Y5 denote the columns of the data matrix before addition of Y, and 
Xi,...,X5 the columns after addition. It is straightforward to compute the 
theoretical covariance matrix: cov{Xi, X2) — cov(yi -l-Y, Y2 + Y) — var(y) = 10 
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and var(Xi) = var(Yi + Y). For the set of inliers — the first 8,000 observa- 
tions — the variance of each column is 10 + 10 = 20, k — 1, . . . , 5. Using 
the original set of biweight weights, the following sample covariance matrix was 
obtained: 



15.29 


6.49 


6.47 


6.57 


6.68 


6.49 


13.35 


6.01 


6.28 


6.36 


6.47 


6.01 


12.49 


6.08 


6.24 


6.57 


6.29 


6.08 


12.97 


6.47 


6.68 


6.36 


6.24 


6.47 


13.42 



Using a cutoff value of 0.3, we obtained the following sample covariance matrix: 



20.19 


8.91 


8.99 


9.16 


9.38 


8.91 


19.31 


8.77 


9.21 


9.33 


8.99 


8.77 


18.35 


8.95 


9.10 


9.16 


9.21 


8.95 


19.14 


9.48 


9.38 


9.33 


9.10 


9.48 


19.45 



which is a considerable improvement and only slightly smaller than the theo- 
retical inlier covariance matrix: 20's on the diagonal and lO's everywhere else. 
Considering the fact that there is considerable overlap between the outliers and 
inliers, and also that the data are not Normally distributed, the latter covariance 
estimate is quite satisfactory. It leads to adequate separation between inlier and 
outlier Mahalanobis distances, so that for this particular data set and cutoff 
value, the outlier identification error is virtually zero. For comparison, we also 
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examine the results obtained by some of the other outher algorithms. For the 
same data set, BACON did not detect any outhers and both its initial and final 

covariance estimates were very similar, the final estimate being 



169.68 9.54 9.52 9.69 9.42 

9.54 20.33 9.82 10.20 10.23 

9.52 9.82 19.81 10.14 10.07 

9.69 10.20 10.14 20.28 10.46 

9.42 10.23 10.07 10.46 20.34 



3.2.2 Determing the Rejection Boundary 

An important issue is how to actually reject the outliers. Ideally, our robust 
estimates of location and scale will lead to robust Mahalanobis distances demon- 
strating a good separation between outliers and inliers. Johnson and Wichern 
[43, ch. 5] state that for multivariate normal data, the classical Mahalanobis 
distance based on the sample mean and covariance matrix is follows an F dis- 
tribution. This implies that an appropriate boundary value could be found by 
selecting a suitable F percentile. Our study showed that boundary values cal- 
culated in this manner were frequently suboptimal. BACON uses a modified 
Xp percentile as described on page 33, we found that the rejection values it ob- 
tained were often too high. Plotting the robust Mahalanobis distances would 
easily allow perfect classification through visual inspection, but the computed 
rejection point led to unnecessary misclassification errors. We therefore decided 
to design a new criterion that didn't require any assumptions and would work 
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for robust Mahalanobis distances computed from a wide variety of data types. 

Intuitively, if robust location and scale estimates have been computed that 
accurately describe the inliers and reveal the unexpectedly large displacement of 
the outhers, the robust Mahalanobis distances will separate into two categories 
— a group with smaller values (inliers) and a group with larger values (out- 
liers). For sufficient data, if we were to calculate the empirical density of these 
distances, we expect to see a peak describing the inliers, after which the density 
would return close to zero, and could either reveal a smaller second peak (if the 
outliers' Mahalanobis distances were sufficiently concentrated) or simply remain 
close to zero. In practice, we found that even with 40% outliers, the first peak 
was very well defined but the second peak was virtually nonexistent because in 
contrast to the inliers, the outliers' distances were widely dispersed. 

The value of Mahalanobis distance at which the density first approaches 
zero could therefore serve as a final rejection boundary. Any observations with 
Mahalanobis distance less than this value would be classified as inliers, and 
otherwise as outliers. This rejection point takes into account the unique charac- 
teristics of each data set and doesn't require any assumptions other than robust 
location/scale estimates that accurately describe the inliers. When the peak 
flattens out and the density returns to zero, the data (inliers) upon which the 
peak is calculated have vanished. We illustrate this concept in Figure 3.1. The 
robust Mahalanobis distances calculated by our method, using a cutoff of 0.3, 
are shown in the first graph, with the corresponding density in the second. (We 
attribute the sharp rise in density for robust Mahalanobis distance greater than 
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100 to be a result of numerical instabilities of the density algorithm at this end- 
point.) The data for these figures were generated with n — 10, 000 and p — 10; 
the first 6,000 points from an uncorrelated x\ cind the remaining 4,000 points 
from an uncorrelated x^^. Both graphs also show the rejection value, 18.975, 
yielding 2.45% outlier error and 1.08% inlier error. For comparison, BACON 
failed to detect nearly all of the outliers in the same data set (98.5% error among 
outliers, no error among inliers). When BACON was modified to compute a re- 
jection point based upon the density method thus described, its outlier error 
rate dropped to 77.4%. (That is, we inserted our density estimation routine into 
the BACON program and used this instead of BACON's modified xlp percentile 
to determine the final rejection boundary.) An even more impressive improve- 
ment occurred when the outliers were drawn from x^g, for 20% contamination 
in p = 5 dimensions. When computing the rejection point according to its usual 
modified percentiles, BACON produced outher and inlier error rates of 98.1% 
and 0.01%, but when modified to use the density method, these respective er- 
ror rates became 38.9% and 3.9%. (The fact that vastly different error rates 
were produced by our method and BACON, utilizing the same outlier rejection 
procedure, points to the fact that robust location/scale estimates are crucial 
to overall performance, leading to good separation between outlier and inlier 
Mahalanobis distances.) 

In practice, rather than requiring the estimated density to actually reach 
zero, it is better to determine the rejection point as the first time that the 
slope of this density enters a small interval containing zero. As the density 
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Figure 3.1: Robust Mahalanobis distances with estimated density and slope 
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decreases, the slope is negative and increasing (approaching zero). Using the 
slope would give better results particularly for the situation where there is very 
little separation between the two groups, and the density peak returns to zero 
very gradually. This can be clearly seen in Figure 3.1, where even among highly 
contaminated data, imposing a smallness condition on the slope results in a very 
accurate classification scheme. We estimate the slope through first differences, 
which we have usually found to be sufficiently smooth for our purposes, but 
we acknowledge that smoothing the slope estimate could improve performance. 
Occasionally, we did experience some difficulty with the slope being zero before 
the peak had ended, but this was actually due to non-smoothness of the esti- 
mated density curve itself. This problem was solved by imposing the additional 
requirement that the density curve be less than a certain percentage of its peak 
value; we found good results with 0.3. An alternative solution (which we didn't 
investigate) would be to smooth these first differences. Examination of several 
empirical density curves revealed that departure from monotonicity following 
the density's peak were not uncommon, and were sometimes longer than just 
a few points. We leave the investigation of smoothing the density's slope for 
future work, however. 

Computing a confidence interval on the slope requires knowledge of the true 
density, and, as previously stated, we cannot assume a distribution for the 
Mahalanobis distances. We therefore decided to empirically compare results 
from four half-interval lengths, centered around zero: (a) 0.001, (b) 0.0005, 
(c) 0.0001, and (d) 0.00005. The first value results in a higher sensitivity to 
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outliers, but also more false negatives, than the other values, and conversely for 
the fourth value. With the small number of simulations performed up to this 
point, it was impossible to tell which of these four values yielded the best overall 
success rates, so throughout most of our investigation we examined results from 
all four values. Only near the end of our study, when we had run thousands of 
simulations, did we feel comfortable in selecting (c) 0.0001 as yielding the overall 
best success rates in a wide variety of data situations. The first value resulted 
in an unacceptably high number of false negatives in many situations, with 
opposite results for the fourth value. It is worth noting that these cutoff values 
are not dependent on the scale of the data, because the Mahalanobis distance 
utilizes a covariance estimate and is accordingly scale invariant. This is one 
reason why it is essential to obtain an accurate covariance estimate. The best 
interval length will, however, depend on the acceptable error rates as described 
above. 

3.2.3 Computing the Sample Density 

There are several methods of computing the sample density. We followed 
the ideas of Silverman [77], and especially his use of the kernel density estimate. 
Let f{x) be the estimated density of / at x. We can calculate f{x) by 

1 " 

i=l 

where w{xi,x) is the weight contributed by the observation Xi at the point x, 
and w{xi,x) satisfies J^^w{xi,x)dx and w{xi,x) > 0\/x,y. It is also clear that 
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the smoothness properties of / will depend heavily upon this weighting function. 
The kernel density is obtained by letting the kernel function K{x,Xi,h) replace 
the weighting function and scaling appropriately: 



w{xi,x) = 




so that the kernel density estimate is given by 

i=l ^ ^ 

The required conditions on K are always satisfied if K is chosen to be 
a probability density, as is usually done. Specifically, these conditions are 
J^^K{x)dx — 1 and K{x) > OVx. (A symmetric K{x) is very useful, but not 
absolutely essential.) If K{x) is a probability density, so will f{x) and more- 
over, f{x) will inherit all the continuity and differentiability properties of K{x). 
Epanechnikov [22] showed that the choice of kernel is not crucial to the mean 
integrated square error of the estimated density, so computational efficiency is 
usually the deciding factor. The normal distribution is a common choice for 
K{x), which we utilize. Among other results, the normal density will ensure 
that K{x) is a smooth curve possesseing derivatives of all orders. The window 
width h determines the degree to which far-away observations influence the den- 
sity estimate at a particular point. It also helps to determine the smoothness of 
the resulting /. An overly small value of h will result in a non-smooth density 
estimate with individual bumps around each observation, tending towards a col- 
lection of Dirac delta functions as /i — > 0, while if h is too large, the resulting / 
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will contain only one very smooth peak over the mean of the data. In our con- 
text, choosing h too small will cause the density estimate to be overly influenced 
by random fluctuations and lead to a non-smooth slope estimate. Letting h be 
too large will cause the outliers to affect the density peak intended to described 
the inliers, and extend the density's return to zero, which is critical in small 
separation situations. For data containing outliers, Silverman [77] recommends 
a robust window width, 

h = 0.9An-^/^ 

where A — min{standard deviation, interquartile range/1.34}. Silverman [77] 
also discusses an adaptive window width which is narrower when the data are 
dense and wider otherwise, but we found satisfactory results with the flxed width 
kernel so we did not investigate this further. 

To our knowledge, the estimated density of robust Mahalanobis distances 
has not been previously used in outlier identiflcation, and we consider this to be 
an important contribution to outlier identiflcation. 

3.2.4 Modifications to the Original Proposal 

For most of this investigation, we also computed a similar estimate of loca- 
tion, namely the coordinate-wise location estimate x^w provided by the biweight 
function. After we were able to efficiently run thousands of simulations, however, 
we rejected this variation. The weighted mean gave slightly better results, in 
terms of producing more separation of outlier and inlier Mahalanobis distances, 
and produced overall error rates approximately 0.5% to 0.75% lower than the 
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coordinate-wise biweight location estimate. 

Another variation that we examined was what we refer to as the MINWT 
option. After computing the biweight weights and reassigning them to either 
zero or one, MINWT consists of setting aU p weights of an observation's com- 
ponents equal to the minimum weight across these components. That is, if at 
least one component receives a weight of zero, the other components of that 
observation also receive a weight of zero. The purpose here is to make abso- 
lutely sure that no outliers receive a weight of one and influence the sample 
mean/covariance calculation. Since the classical sample mean and covariance 
are very sensitive to outliers, they could easily be adversely affected if sufficient 
care isn't taken to ensure an outlier-free subset. If a particular observation is an 
outlier, some of its components might be slightly inside the region of acceptance, 
and receive weights above the cutoff value. These weights would be raised to one, 
influencing the robust estimates more than was intended. The MINWT option 
is certainly not very efficient and would only be feasible when there is sufficient 
data to discard questionable points. For certain difficult situations, this option 
did in fact lead to significantly better results; notably for heavy contamination 
(30% - 40%), and very little outlier separation. 

Initially, our cutoff value of 0.3 was somewhat arbitrarily chosen as a value 
that might provide an initial outlier- free subset. Subsequent analysis indicated 
that 0.3 might not be the best choice; better results were found using a higher 
cutoff. For the location estimate Xb^, define u{x) as a measure of the scaled 
distance from Xb^' 
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u — 



cS 



The biweight weighting scheme can then be written in the form 



w{u) = (1 -1*^)^. 



The K-value corresponding to w{u) = 0.3 is 




0.3 = 0.6725. 



For a cutoff of 0.3, if a point x satisfies \u{x) \ > 0.6725, then w{x) — 0, otherwise 
w{x) = 1 for < 0.6725. The effect of this reassignment is to change the 

smooth biweight weighting function to a step function with discontinuities at 
u — ±0.6725. We depict these two functions graphically in Figure 3.2. 

It can be seen from this figure that the area beneath the curve of the mod- 
ified weighting function is considerably larger than that beneath the biweight. 
Intuitively, it seems that this would result in the overall weights being too high, 
and allowing too many outliers to infiuence the robust estimates. It might be 
good for the area beneath the two curves to be at least reasonably similar so 
that when the weights are changed, certain properties are preserved, such as 
the mean weight w. It is not hard to calculate the cutoff value at which the 
two weighting schemes will have the same area under the curve. For ease of 
computation, we consider only u> 0. 
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Figure 3.2: Weighting function for the biweight and our proposed method 



so that if the area under the curve for the step function is to equal 0.53, the 
cutoff value needs to be 

w{0.53) = (1 - (0.53)2)2 ^ Q_52. 

Using a cutoff value of 0.52, we did in fact find increased success at outlier 
detection than with 0.3. We also continued the trend of examining higher cutoff 
values and found that outlier detection success continued to increase, with a 
markedly smaller increase in the inlier error rate . For instance, on a 10, 000 x 5 
data set with 8,000 observations from uncorrelated xi a-nd 2,000 observations 
from uncorrelated xfg, a cutoff value of 0.3 resulted in an outlier error rate of 
7.21% and an inlier error rate of 1.90%. The corresponding figures for 0.5 cutoff 
were 3.41% outher error and 1.63% inher error, while for 0.8 cutoff they were 
1.94% and 1.81%, respectively. Using a higher cutoff such as 0.8 would lead 
to overall lower weights being assigned, and more conservative location/scale 
estimates, yielding a higher sensitivity to outliers. We surmise that perhaps the 
benefits of the same area under the curve are not as important as other aspects 
of the algorithm. At this point, we leave the cutoff as a parameter yet to be 
ascertained, and return to it in the following chapter. 

A higher cutoff will reject more data, and it is certainly possible that the 
covariance estimate will be too small to give an accurate description of the 
inliers. We therefore investigated Winsorization of the covariance matrix. Win- 
sorization is a robust technique most often applied to the mean, and related to 
the trimmed mean. Given n ordered observations . . . ,?/(„), the symmetric 
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Winsorized mean is given by 



yw = -{g- y{g+l) + 1/(9+1) + y(g+2) + . . . + y{n-9) + g ■ V{n-g))- 

I b 

Winsorization has the property that it greatly lessens the influence of outhers 
by effectively moving them to the boundary of acceptability, but does not reject 
them completely. Dixon [18] notes that this procedure is called Winsorization in 
honor of Charles P. Winsor, who actively promoted its use. Winsorized variances 
are not as commonly used, however. 

Our situation is not as straightforward as computing the Winsorized vari- 
ance on a set of observations but the principles can still be applied. If 
an observation Xi is beyond the region of acceptability, it will be given a low 
weight Wi regardless of which side of the location estimate it is situated on. An 
observation with weight equal to (or very close to) the cutoff point will lie on 
the boundary of the region of acceptability; this is the point where we would 
like to start "Winsorizing" . We therefore search for the observation with the 
closest possible weight to the cutoff value (but not smaller) , and record its value 
Xyj. When computing the variance summation, we effectively replace all ob- 
servations with weight less than the cutoff point, with this recorded value x^j. 
Assume that the weight of Xy, is the k^^ order statistic among the sequence of 
n weights {wi} for that particular dimension, i.e. the weight of Xy, is the k^^ 
smallest for that dimension. The numerator is thus a sum of squared devia- 
tions: k ■ {xy, — xf -\- 'Y^=k+ii-'^i ~ ^Y-i where x is the location estimate for that 
dimension. 
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Determining a suitable divisor required further investigation. The divisor for 
the Winsorized mean is always taken to be n, but for the Winsorized variance, 
several authors [83], [85] use h = n — 2g, while [37] uses n(l — ^)^. [73] utilize 
h — n — 2g, but also state that using n would result in more conservative 
confidence intervals. A larger divisor (such as n) would result in a smaller 
covariance matrix, and increased sensitivity to outliers, a desirable property. We 
accordingly conducted a small experiment to examine the effect of this divisor 
on outher and inlier error rates; the results are shown in Table 3.4. For these 
comparisons, we ran 1,000 simulations on a 10, 000 x 5 data set, with inliers 
drawn from uncorrelated xt 20% outliers from uncorrelated Xib^ ^ well 
as the MINWT option. Note that for 20% contamination, a divisor of O.Sn 
corresponds to n — g. For this particular data situation, the best error rates 
can be obtained with a divisor larger than n. Employing a type of argument 
commonly used against overfitting, however, we are not sure this would be the 
best strategy for all data situations — especially for low contamination levels, a 
large divisor would create problems and cause too many inliers to be rejected. 
A divisor of n seems to be a good compromise; it is reasonably conservative by 
not allowing outliers to heavily influence the initial location/scale estimates. We 
henceforth use this value during Winsorization. 

It is also worth noting that we Winsorize only the diagonal elements of the 
covariance matrix, not the off-diagonal elements. Moving outlying observations 
an unspecified distance to the boundary of acceptability will clearly affect how 
the i^^ coordinate varies with the j^^ coordinate, Ylki^l ~ — ^^)- Thus, 
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Table 3.4: Examining the effect of tlie Winsorization divisor 



Divisor 


Outlier Error % 


Inlier Error % 


0.7// 


2.18 


1.70 


0.75n 


2.13 


1.73 


0.8n 


2.05 


1.78 


0.85n 


2.00 


1.82 


0.9n 


1.94 


1.86 


0.95n 


1.88 


1.90 


l.On 


1.81 


1.95 


1.05n 


1.77 


1.99 


l.ln 


1.73 


2.02 


1.15n 


1.68 


2.07 


1.2n 


1.64 


2.11 


1.25n 


1.60 


2.15 


1.3n 


1.57 


2.18 


1.35n 


1.53 


2.22 


1.4n 


1.49 


2.27 


1.45n 


1.46 


2.30 


1.5n 


1.42 


2.34 


1.6n 


1.36 


2.42 


1.7n 


1.30 


2.51 


2.0n 


1.15 


2.73 
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the only values we use in computing the off-diagonal elements are those with 
weight equal to one. 

Generally, we found that Winsorizing the variance gave better results. We 
present some examples of covariance matrices obtained with and without Win- 
sorization. Consider a 10, 000 x 5 data set drawn from a mixture of uncorrelated 
Xg and Xi5) a situation where observations from the two groups overlap. We 
consider contamination levels of 5%, 10% and 20%. Without Winsorization, the 
covariance matrix of the observations in the initial subset, using a cutoff of 0.7 
with 5% contamination, was estimated as 



5.48 


-0.08 


0.09 


0.03 


0.003 


-0.08 


5.28 


-0.0004 


0.01 


-0.01 


0.09 


-0.0004 


5.31 


-0.05 


0.06 


0.03 


0.01 


-0.05 


5.52 


0.08 


0.003 


-0.01 


0.05 


0.08 


5.21 



Recall that for xh theoretical variance is 10. Employing Winsorization and 
with the rest of the parameters kept constant, the diagonal elements were 

{8.40 8.19 8.23 8.45 7.96}. 

(The off-diagonal elements were the same since we only Winsorize the variance 
estimates.) For 10% contamination and without Winsorization, the estimated 
diagonal elements were 

{6.21 6.13 6.24 6.40 5.88}, 
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and with Winsorization 

{10.23 10.06 10.11 10.45 9.72}. 

Finally, for 20% contamination, the diagonal elements without Winsorization 

were estimated to be 

{9.31 9.13 9.19 9.34 9.23}, 
and with Winsorization as 

{16.21 16.03 16.06 16.42 15.94}. 

Winsorization increases the variance because to some extent, it still considers the 
outliers. We computed the error rates from the above situations, based on 1,000 
simulations and no MINWT, and present the results in Table 3.5. As before, we 
utilize the format out.err%/in.err% to denote the outlier and inlier percentage 
errors. It is not possible to draw any conclusions from this table. Based on 
the actual covariance estimates, however, it seems likely that Winsorization 
could be beneficial for low contamination levels, but as the variance estimate is 
inflated with increasing percentage of outliers, augmenting the variance through 
Winsorization becomes undesirable. 

Another variation which was we tried was wpct. Instead of a fixed cutoff, 
select the observations for the initial subset according to the upper wpct per- 
centage of weights, where wpct is determined in advance. Not surprisingly, this 
method produced great results if wpct was close to the percentage of actual in- 
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Table 3.5: Examining the effect of Winsorization 



Contamination Level 


Without Winsorization 


With Winsorization 


5% 


0.55/4.63 


0.75/3.88 


10% 


0.91/3.77 


1.19/3.08 


20% 


4.55/2.64 


4.31/2.00 



liers in the data, but faltered if there was a big discrepancy. Sometimes general 
characteristics of the type of data being investigated can lead the researcher to 
develop an approximate estimate of the contamination level. For instance, many 
data sets usually contain less than 5% outliers. Several outlier methods require 
the number of outlier to be specified in advance, such as Grubbs test for k out- 
liers in a specified tail [29], [37], so employing wpct would put our procedure in 
this class of algorithms, wpct is a good alternative to other algorithms in this 
class because it does not require the exact amount of outliers to be specified in 
advance. It will still yield good results as long as the two percentages are fairly 
close, and demonstrates a relatively gradual drop in performance with increas- 
ing discrepancy. For an unknown percentage of outliers, however, a fixed cutoff 
value is to be preferred. 

This algorithm is computationally appealing, as will be analyzed in more 
detail on page 141. Since each dimension is analyzed separately, computational 
burden grows relatively slowly with increasing dimension. This is in marked con- 
trast to most other methods, such as the MCD, Kurtosisl and even MULTOUT 
(which depends on the MCD in its initial stage) , where increasing p leads to sig- 
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nificant increase in computational time. Fitting the actual biweight function is 
not computationally intense, and once the weights have been assigned, calculat- 
ing the sample mean and covariance matrix are the fastest possible methods of 
obtaining meaningful location and scale estimates. At this stage, estimation of 
the sample density is the most time-consuming step of the algorithm, but as we 
discuss on page 75, faster methods exist for this step, which lead to considerable 
improvement. 

3.3 A Summary of Our Proposed Method 

Before proceeding to the next chapter, we summarize our proposed method. 

1. Calculate the biweight weights for each data point using the algorithm 
described on page 48, fitting each component separately. 

2. If an observation's weight is less than the cutoff value, reassign that weight 
to zero, otherwise reassign it to one. The optimal cutoff value still needs 
to be investigated. 

3. Option MINWT: Select the minimum weight from each observation's com- 
ponents as the weight for the other p—1 components too. The effectiveness 
of this option still needs to be determined. The order of this step can also 
be interchanged with the previous step. 

4. Compute the weighted mean and covariance of all the observations, which 
amounts to the sample mean and covariance of those observations in the 
initial subset. The effect of Winsorizing the variance still needs to be 
investigated. 
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5. Calculate a robust Mahalanobis distance for all n observations, using the 
mean and covariance matrix computed in the previous step. 

6. Calculate the density of these Mahalanobis distances, using the Normal 
density as the kernel and window width as described on page 60, and use 
first differences to approximate the slope. 

7. There will be a peak describing the inliers, where the slope has a large 
positive value followed by a large negative value and increases towards 
zero. Find the value of Mahalanobis distance where this negative slope 
first enters a confidence interval containing zero. A good value for the 
half-length of this interval is 0.0001. Let the value of Mahalanobis distance 
where the slope is sufficiently close to zero (and where the density is also 
less than 0.3 of its maximum value), be called the rejection point. 

8. Classify all points as inliers or outliers based on whether their Mahalanobis 
distance is smaller or larger than the rejection point. 
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4. Results from the Proposed Method 

4.1 Computational Details 

Having developed an algorithm for identifying outliers, it is necessary to 
evaluate its performance over a wide range of data types. We also need to ex- 
amine its computational feasibility on larger data sets, the intended primary 
application. So far we have tested our method in Splus, which is very slow, 
and particularly inept at handling large loops. We therefore decided to imple- 
ment it as a Fortran program, which could be used to rapidly run thousands 
of simulations for diverse data types and different parameter values. Fortran 
was selected because of the abundance of existing mathematical and statistical 
subroutines, some of which utilized complex numerical techniques and had been 
written to be computationally efficient. The specific version used was Fortran 
90, a considerable improvement over Fortran 77. 1 did not make a full study 
of all the improvements offered by F90, and there are almost certainly several 
of my routines that could be optimized for F90. I am also aware of the devel- 
opment of F95, but decided to designate this upgrade as a future project. F95 
is an object oriented language and I have had only limited experience with the 
object oriented approach. 

A very important issue that arises in these types of numerical investigations 
is the randomness of the pseudo-random number generator (PRNG). For the 
programs presented in this thesis, we utilize the random number generators 
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described in the second volume of Numerical Recipes for Fortran [62, ch. B7], 
updated and improved for F90. The period of this random number generator is 
2.3 X 10^*, which more than suffices for our applications. The PRNG presented 
in the second volume of Numerical Recipes [62] is a faster and more efficient 
version of the author's ran2 routine in the first volume [61]. In the first volume 
[61], the authors claim that within the limits of floating point precision, ran2 
provides "perfect" random numbers. As a practical definition of perfect, they 
offered $1,000 to the first reader who could prove that ran2 fails to generate 
random numbers, by means of a statistical test or otherwise. Ten years later 
when the second volume [62] was published, the authors stated that the award 
had not yet been claimed. This PRNG produces sequences of random C/(0, 1) 
numbers, from which appropriate transformations can be made to obtain random 
data from other distributions. Some of the transformation routines are also 
provided in the first volume of Numerical Recipes [61, ch. 7]. For instance, 
to obtain normally distributed random numbers from ?7(0, 1) deviates, we use 
the Box-MuUer transformation (described in [62], with corresponding Fortran 
code), xl/ deviates can be easily obtained from Gamma deviates, since the Chi- 
squared distribution is a special case of the Gamma distribution, but in general, 
obtaining random Gamma deviates is a nontrivial problem. To generate these 
deviates, we utilize Fortran routines posted at ranlib [63] , a division of the Netlib 
repository, which are based on a modified rejection technique described in [2]. 
Based on these algorithms, we therefore assume that the data generated for the 
purpose of this thesis are sufficiently random. 
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One aspect of our algorithm which necessitated some change, was the density 
calculation. The method described in the previous chapter utilizes an inefficient, 
complete double summation. For each point at which we wish to estimate the 
density, we have to compute the influence of all n Mahalanobis distances, even 
though many of these distances are too far removed to have a measurable effect 
on the density at that particular point.) In Apphed Statistics algorithm AS 176, 
Silverman [76] describes and provides Fortran code for a routine based on the 
Fast Fourier Transform which is significantly faster than the complete double 
summation. Part of his algorithm involves building a histogram of the data with 
2*^ cells (we use k — 9), and then applying the FFT to both the data and the 
kernel K{x). (Recall that we used a Gaussian kernel.) We found that utihzing 
this procedure reduced the computation time by approximately a factor of 10. 

Another issue that arose during the practical implementation of our algo- 
rithm was finding the desired rejection point, specifically concerning smooth- 
ness of the density estimate. The first point where the slope enters the interval 
[-0.0001,0.0001] is not always at the end of the main density peak, due to small 
numerical inaccuracies in the estimated density. Sometimes the estimated slope 
assumed very small negative values within the first few data points. We could 
easily solve this problem by requiring the rejection point to be greater than half 
the mean Mahalanobis distance. A second problem arose because the density 
peak did not always return to zero in a smooth curve — sometimes there were 
short plateaus, or other deviations from what should be a monotonic decreasing 
segment. At times, the lengths of these plateaus were such that that removing 
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them through a smoothing process would require the smoothing window to be 
sufficiently large as to have other undesirable effects on the estimated density, 
such as increase in mean integrated square error. An alternative (and faster) 
solution would be to constrain the rejection point to regions where the density 
was less than a certain fraction of its maximum peak value. After some inves- 
tigation, we settled on a value of 0.3 for this fraction. For the majority of data 
situations, the algorithm's performance was not sensitive to this value of 0.3, 
but for high contamination (35% - 40%) the density curve returned to zero very 
slowly and raising this value to 0.4 did produce slightly better results. 

A few other miscellaneous notes regarding our program's computational 
specifics are in order. We use double precision in all of our programs to guard 
against round-off error. To calculate the median (used for both the biweight 
algorithm and BACON), we employ a version of Quicksort which sorts only as 
much data as necessary to locate the median; specifically, a subroutine written by 
Alan Miller [55]. For the interquartile range (to determine the optimal window 
width for density estimation), I used the function selection from Numerical 
Recipes [61, p. 334], which calculates the k*^ smallest value in an array. Finally, 
to calculate the inverse of the PDS covariancc matrix for Mahalanobis distance 
calculation, I used the subroutine dpodi, from the Unpack [49] repository of linear 
algebra public domain software, another subdivision of Netlib. 

Naturally, we want to compare our results with the best existing outlier 
methods. None of the other algorithms we examined up to this point were in a 
form suitable for rapid multiple iterations, so we wrote Fortran code for these 



76 



algorithms too, in order to run thousands of iterations and effectively compare 
performances. BACON has usually produced good results up to this point, so it 
was a natural choice for further investigation. The executable version provided 
by the author ran on an MS-DOS platform; since it was not possible to view 
or modify the source code, I wrote a Fortran program based directly on the 
article from Hadi et al. [32], which will henceforth be referred to as BACON* 
to distinguish it from the author' original implementation. I again utilized some 
of the linear algebra routines from Unpack [49] , and additionally also a collection 
of routines from cdflib [10] to calculate the inverse percentiles. The latter 
routines were written by Brown and Lovato (MD Anderson Cancer Center), 
and based on an algorithm by DiDonato and Morris [17] as well as tables from 
Abramowitz and Stegun [1]. 

Another candidate that has performed well up this point is Kurtosisl. It 
does not appear to be perturbed by difficult data types such as concentrated 
outlier clusters or non-Normal data. It also is very appealing in that it rep- 
resents an innovative class of outlier algorithms, based on improved projection 
pursuit methods, and redesigned so as to be significantly more computation- 
ally feasible. The original MATLAB code was clearly not suitable for our use 
since it restricted the data set to n — 750, while the MATLAB platform was 
also very slow and not intended for these type of simulations. I therefore wrote 
Fortran code based directly on the MATLAB program, which will henceforth 
be referred to as Kurtosisl*. In addition to the routines already mentioned, 
Kurtosisl* also necessitated a large number of other hnear algebra routines 
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such as Cholesky decomposition, simultaneous solution of systems of equations 
(including also special cases for under /over-determined systems of equations), 
eigenvalue/eigenvector calculation (for several different matrix types), QR fac- 
torization and LQ factorization. I was able to obtain Fortran subroutines for 
these procedures from lapack [47] and eispack [21], both special sections of the 
Netlib repository. Although the portion of the code that 1 wrote was only about 
1,500 lines, after adding all the required subroutines, the total program length 
was nearly 18,000 hues. I am indeed very grateful to the authors of these routines 
for making them available on the public domain. 

We did not examine the MULTOUT algorithm during the large-scale sim- 
ulation stage. Preliminary results seemed to indicate that it did not perform 
as well on non-standard data types such as outlier clusters and possibly also in 
higher dimensions such as p = 20. Juan and Prieto [44] conducted a simula- 
tion experiment using Normal data with small separations, comparing the MCD, 
MULTOUT and other methods, in which MULTOUT demonstrated significantly 
decreased performance (100% failure rate in detecting outliers) in higher dimen- 
sions (p= 10,20). 

We have not examined MULTOUT sufficiently to draw definite conclusions, 
but we do note that decreased performance in higher dimensions could be a 
result of utihzing the MCD for the initial stage. The MCD does not handle 
increased dimension very well; subdividing the data into smaller cells reduces n 
but not p. Though it did not influence our decision, we also note that the C 
implementation of MULTOUT took very long to execute, even longer than our 
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Fortran version of Kurtosisl* (which took a very long time, as will be discussed 
later). The other algorithm we previously examined, the MCD, is for reasons 
already discussed not a viable option for use in large data sets, and also fared the 
worst of the 4 methods we examined. Thus we decided to compare our method 
to only BACON* and Kurtosisl*. 

The choice of data is also important. Most outlier algorithms in the liter- 
ature are tested on Normally distributed data, even though there is no reason 
to believe that the majority of applications and real world data sets satisfy this 
assumption. When these algorithms are tested on real data sets, these sets are 
typically of small n and p. We felt that the distribution offered a good al- 
ternative to the normal, due to its assymmetry and longer tails. For instance, 
when mixing xl with X15 or xlo^ there is significant overlap between the two 
groups and the algorithms could be tested under extreme circumstances. 

4.2 Choice of Pcirameters for our Method 

With the limited investigations performed up to this point, it has not been 
possible to determine the best value for the cutoff parameter ctf, or the success 
of Winsorization and MINWT. We therefore ran 1,000 simulations with cutoff 
values from 0.5 to 0.9, and the permutation set of Winsorization crossed with 
MINWT. We used a data set of size 10, 000 x 5, with 8,000 observations drawn 
from uncorrelated xl ^-nd 2,000 from uncorrelated Xib- This resulted in some, 
but not excessive, outlier overlap, and a moderate contamination level. The re- 
sults are presented in Table 4.1, utilizing the format outlier, error %/inlier.error% 
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Table 4.1: Examining the effect of cutoff, Winsorization and MINWT 





VV lllo. 


VV lllo. 


iNU VV lllo. 


iNU VV lllo. 


off 


MINWT 

iVJ-Xl 1 V V J- 


No MINWT 


MINWT 


No MINWT 


0.5 


3.34/1.79 


8.98/2.24 


3.04/2.37 


11.78/2.78 


0.55 


2.96/1.76 


7.57/2.16 


2.49/2.34 


9.45/2.85 


0.6 


2.68/1.75 


6.27/2.11 


2.13/2.31 


7.58/2.74 


0.65 


2.50/1.74 


5.24/2.03 


1.86/2.32 


6.00/2.66 


0.7 


2.30/1.77 


4.36/1.97 


1.64/2.36 


4.63/2.60 


0.75 


2.13/1.80 


3.61/1.92 


1.48/2.40 


3.49/2.55 


0.8 


1.98/1.84 


2.93/1.90 


1.31/2.50 


2.54/2.57 


0.85 


1.81/1.95 


2.48/1.92 


1.16/2.71 


1.92/2.69 


0.9 


1.79/2.13 


2.27/2.04 


1.05/3.34 


1.46/3.13 


0.95 


1.74/2.58 


1.98/2.49 


0.76/10.00 


0.89/10.12 
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discussed in Table 3.1 on p. 43. 

We notice from Table 4.1 that as the cutoff increases, the outlier error rate 
decreases monotically while the inlier error rate follows a quadratic curve; we 
indicate the minimum of this quadratic in bold type. For increased cutoff, the 
variance estimate shrinks because a greater portion of the outlying points are 
assigned a weight of zero and are not included in the estimate. A smaller, more 
accurate variance estimate causes distant points to receive large robust Maha- 
lanobis distances, with a greater possibihty of being labelled as outliers. Since 
the outliers clearly do lie farther away, a smaller variance estimate results in 
more outliers possessing large robust Mahalanobis distances, a greater separa- 
tion from the inliers, and an increased likelihood of being correctly labeled as 
outliers. Hence the outlier error rate decreases with increasing cutoff. 

Investigation of the quadratic trend of the inlier error rates showed that for 
lower cutoff values, the off-diagonal elements of the covariance estimate were 
considerably larger than zero, e.g. for a cutoff of 0.5, the off-diagonal elements 
were on the order of 2.5. This may be due to unintended correlations between 
the inliers and outliers. As the cutoff increases, fewer outliers are assigned a 
weight of one, and the off-diagonal elements decrease to zero. The improved 
estimates of these off-diagonal elements result in lower robust Mahalanobis dis- 
tances for the inliers, and thus fewer of them are incorrectly rejected. The error 
rate therefore decreases. When these off-diagonal elements are sufficiently close 
to zero, however, improving them no longer leads to lower inlier error rates, 
which subsequently start increasing. In contrast to the outlier error rate, a 



81 



smaller variance estimate leads to higher inlier error rates because some of the 
inhers receive large robust Mahalanobis distances, and are incorrectly labelled 
as outliers. When we artificially force the off-diagonal covariance elements to 
be zero, the inlier error rate increases monotically with the cutoff. These two 
opposite trends highlight the difficulty in obtaining a uniformly best cutoff value. 

It appears that the combination of No Winsorization and No MINWT does 
not lead to as good results as the other 3 combinations, so we do not consider it 
further. A possible explanation for this is that without any kind of adjustment, 
20% outliers is a sufficiently heavy contamination level to adversely affect the 
covariance estimate, and leads to an inadequate separation between inlier and 
outher Mahalanobis distances. We also selected 2 values of the cutoff to in- 
vestigate at each of the remaining three factor combinations, placing somewhat 
more weight on lower outlier error, while still maintaining reasonable inlier er- 
rors. For Winsorization and No MINWT, we investigate cutoff values of 0.85 
and 0.9, while for MINWT, and both levels of Winsorization, we investigate 
cutoff values of 0.8 and 0.85. 

We examine dimensions p = 5, 10, 20 and contamination levels e = 5%, 10%, 
20%, 30%, 35%, 40%. e = 35% was examined because the algorithm sometimes 
demonstrated a marked increase in error rate between e = 30% and e = 40%; 
adding this value showed that the sudden increase usually occurred between e = 
35% and e = 40%. We arc not aware of significant outlier research or testing on 
non-normal data, so we investigate a mixture of distributions as an example 
of highly skewed data. It is still important, however, to verify performance of our 
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method on normally distributed data, so we also investigate two types of shift 
outhers, the most difficult type of Gaussian outliers /citerocke96. We are also not 
aware of significant outlier research on correlated data, so the third category of 
outlier distributions we examine is a mixture of correlated multivariate normals. 
The multivariate normal distribution was chosen as an example of correlated 
data because it will be easy to compare results directly with the uncorrelated 
shift outhers. 

The computational cost of running these simulations is reasonable, so we 
follow a complete factorial experimental design [9]. We ran 1,000 simulations 
for our proposed method, as well as for BACON*, but only 10 iterations for 
Kurtosisl*, which still took roughly 3-5 times longer than 1,000 iterations with 
our method. Wc will conduct a more complete investigation of computational 
time in section 4.7, but at this stage, it is important to note that the reported 
results for Kurtosisl* do not have the same accuracy as the other methods. 

A simple geometric argument reveals that for a given inlier distribution, 
if the same outlier distribution is used for p — 5, p — 10 and p — 20, the 
average distance to the outliers will increase with increased dimension, and an 
algorithm's success rate will increase with dimension, provided it is capable of 
analyzing large dimensions (the MVE and MCD are examples of algorithms 
that do not satisfy the latter condition.) For instance, the one-dimensional 
distance from the origin to a point 5 units away is clearly 5, while the two- 
dimensional distance from the origin to (5,5) is -\/5^~+~5^ = 7.07. To compare 
the effect of dimension on outher detection sucess, we therefore need to scale the 
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distance to the outlier mean so that it remains approximately constant for all 
dimensions. The Euclidean distance could be used to calculate the appropriate 
scale factors for uncorrelated data with the same variance in each dimension, 
but for a general data set, the Mahalanobis distance between two populations 
is a more appropriate metric [75, p. 36] that takes into account not only the 
variance of the data, but also the correlations between components. For a given 
separation between the inlier and outlier distribution means, which we will call 
S, the expected value of the Mahalanobis distance needs to remain constant 
regardless of dimension: 

where is the mean of the inlier distribution, /Lt^^ the mean of the outlier 
distribution, and S is a convenient scale matrix [51, p. 31], which we take 
to be the covariance matrix of the inlier distribution. Setting (ci, . . . , o?)^ = 
-^[(Min ~ l^miti^ 1 i-®- ^ is one-dimensional difference between and /Ug^^, 
we can evaluate the expected value of the Mahalanobis distance as 

where c*-' are the elements of the inverse covariance matrix For a given 

value of 5^, we therefore determine |x^^ such that d = a/^Y"^^^- For 
the simulations in the following sections, we will examine several values of 5^, 
and among others, note how rapidly success at outlier detection improves with 
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increasing S^. 

4.3 Skewed Data: A Mixture of Distributions 

For the simulations in this section, we always generate inliers as uncorrelated 
Xl deviates. For p = 5, we generate outhers from x?o> Xi3' Xi5' xio' xls- 
Since S is a diagonal matrix with lO's on the diagonal and O's elsewhere, the 
separation level for xlo outliers corresponds to 5^ = 5^ • 0.5 = 12.5, with the 
other respective separation levels being S"^ — 32, 50, 112.5 and 200. 

4.3.1 A Mixture of x^'s with 5^ = 12.5 

In this section, we compare simulation results where the distributions of 
outliers and inliers overlap considerably. At p = 5, outliers are drawn from Xioj 
while at p = 10, we compute d as d = ^/WT^T/^ = ^12.5/(10 ■ 0.1) = 
3.54, so fig^f = 8.54 • 1. Similarly, at p = 20 and Xio^ we compute d = 
v/12.5/(20-0.1) = 2.5, so n^, = 7.5 • 1. 

A sample scatterplot of 10,000 observations is shown in Figure 4.1, where 
the first 8,000 observations are drawn from xl ^^^^ the last 2,000 from Xio- 
Also shown in this figure are Mahalanobis distances calculated by our method 
from uncorrelated Xio outliers in p = 5 dimensions, together with the rejection 
boundary used to classify points as outliers or inliers. The data in the scatterplot 
constitute the first dimension of the data set on which the Mahalanobis distances 
are calculated, with the data in the remaining 4 dimensions generated in a 
similar fashion . Specific parameters used for this calculation were MINWT, No 
Winsorization, and a cutoff value of 0.85. It is clear that this is a very difficult 
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situation for any method to correctly identify the outhers, and thus the error 
rates are higher than what we would hope for in actual practice. The rejection 
value shown here is 19.03, leading to 523/2000 = 26.15% outlier error rate and 
563/8000 = 7.04% inlier error rate. 

We first examine results for p = 5, in Tables 4.2 and 4.3. Note that the 
third row of the column headings indicates the cutoff value — either 0.8, 0.85 or 
0.9. Space considerations lead us to display these results in two tables: the first 
table containing the first four combinations of Winsorization and MINWT, and 
the second table containing the remaining two combinations as well as BACON* 
and Kurtosisl*. We continue to use the same format for inlier and outlier error 
percentages: outlier, error % /inlier. error%. Sometimes an algorithm will fail to 
detect all or nearly of the outliers in a data set. Rather than give the actual error 
rates in these cases, we simply indicate that the algorithm failed by a dash: — . 
The boundary chosen for this definition of "failure" is 90% (chosen somewhat 
arbitrarily); we could also have selected 80%, because an algorithm that fails to 
detect 80% of the outliers is essentially useless in practice. For the purposes of 
examining error rates, however, we felt that a boundary of 80% would discard 
too much data. Practically, there is hardly any difference between an algorithm 
that fails to detect 91% of the outliers versus one that fails to detect 97%. 
We never encountered an inlier error rate greater than 90%, so a dash always 
indicates that the outlier rate was too high to be meaningful. 

We notice that BACON* in particular does not seem able to handle this 
outlier situation, although all the methods experience some degree of difficulty. 
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Figure 4.1: Outliers from xlo with Mahalanobis distances 
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Table 4.2: Simulation results for Xio outliers at p = 5, part 1 



Out% 


Wins. 


Wins. 


Wins. 


Wins. 




No MNWT 


No MNWT 


MNWT 


MNWT 




0.85 


0.9 


0.8 


0.85 


5% 


26.15/6.23 


24.36/6.99 


24.01/6.72 


23.15/7.08 


10% 


30.75/5.31 


28.87/5.87 


27.96/5.62 


26.75/6.01 


20% 


43.47/3.64 


40.57/4.04 


37.82/3.91 


36.56/4.11 


30% 


60.59/4.04 


55.27/2.67 


52.91/2.46 


50.07/2.70 


35% 


68.79/1.71 


63.57/2.05 


64.06/1.69 


60.86/1.90 


40% 


75.28/1.35 


70.80/1.57 


73.55/1.22 


70.25/1.39 



Table 4.3: Simulation results for xlo outliers at p = 5, part 2 



Out% 


No Wins. 


No Wins. 


BACON* 


Kurtosisl* 




MNWT 


MNWT 








0.8 


0.85 






5% 


22.38/7.33 


20.29/8.31 




27.83/6.50 


10% 


25.59/6.36 


23.56/7.14 




34.00/5.70 


20% 


34.62/4.71 


31.86/5.34 




56.65/3.68 


30% 


48.89/3.31 


44.28/3.95 




74.72/2.27 


35% 


59.44/2.52 


54.42/3.09 




79.45/1.83 


40% 


69.91/1.83 


64.78/2.38 




83.47/1.47 
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Table 4.4: Simulation results for xl.54 outliers a,t p — 10, part 1 



Out% 


Wins. 


Wins. 


Wins. 


Wins. 




No MNWT 


No MNWT 


MNWT 


MNWT 




0.85 


0.9 


0.8 


0.85 


5% 


28.69/6.88 


26.20/7.93 


21.59/9.25 


20.95/9.71 


10% 


33.97/6.02 


30.73/6.94 


24.76/8.05 


24.12/8.46 


20% 


45.38/4.71 


40.48/5.56 


31.17/6.47 


30.24/6.85 


30% 


59.74/3.26 


52.64/4.15 


40.18/5.16 


38.76/5.55 


35% 


67.04/2.55 


59.76/3.32 


47.09/4.36 


45.37/4.77 


40% 


73.03/2.04 


66.24/2.67 


58.34/3.04 


55.92/3.50 



Table 4.5: Simulation results for x| 54 outliers at p = 10, part 2 



Out% 


No Wins. 


No Wins. 


BACON* 


Kurtosisl* 




MNWT 


MNWT 








0.8 


0.85 






5% 


22.02/9.04 


19.12/10.80 




34.80/6.94 


10% 


24.39/8.20 


21.69/9.69 




41.05/6.28 


20% 


29.26/7.17 


26.07/8.60 




60.01/4.50 


30% 


36.75/6.23 


33.48/7.44 




73.48/3.28 


35% 


43.70/5.29 


40.53/6.29 




77.54/2.76 


40% 


54.97/3.79 


51.53/4.70 




80.85/2.27 
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The results for p = 10 are shown in Tables 4.4 and 4.5. It is interesting to 
note that error rates for our method are approximately the same at the lower 
contamination levels compared to p = 5, but at the higher contamination levels, 
the outlier error rates are definitely lower than at p = 5, while the inlier error 
rates are shghtly higher. With only 10 simulations for Kurtosisl*, the changes 
in error rates fall within the uncertainty interval. 

The results for p — 20 are shown in Tables 4.6 and 4.7. It appears that with 
significant outlier overlap on skewed data, our method performs considerably 
better than either BACON* or Kurtosisl*. For the same value of 5^, it does not 
seem to be adversely affected by higher dimensions (p — 10 or 20). 

4.3.2 A Mixture of x^'s with 5^ = 32 

Analogous to the previous section, we now present simulation results where 
the outliers are drawn from an uncorrelated distribution such that — 32. At 
p — 5, this corresponds to drawing outliers from Xis, at p = 10 from Xio.66' 
at p = 20 from Xg- In Figure 4.2 we plot the first dimension of a 10, 000 x 5 data 
set, together with Mahalanobis distances, calculated with the same algorithm 
parameters as before. For this data set, the rejection value is 28.35, shown as 
a horizontal line in the figure, which leads to 69/2000 = 3.45% error among 
outhers and 420/8000 = 5.25% error among inliers. 

The results for p = 5 are shown in tables 4.8 and 4.9. Compared to outhers 
from Xio, there is a big decrease in the error rates for our method and Kurtosisl*. 
for a relatively small increase in separation between the two groups. BACON* 
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Table 4.6: Simulation results for X7.5 outliers at p = 20, part 1 



Out% 


Wins. 


Wins. 


Wins. 


Wins. 




No MNWT 


No MNWT 


MNWT 


MNWT 




0.85 


0.9 


0.8 


0.85 


5% 


28.02/8.78 


24.70/10.37 


19.83/12.45 


22.38/13.60 


10% 


32.50/8.13 


28.58/9.54 


22.00/11.25 


24.52/12.11 


20% 


42.96/6.56 


37.93/7.63 


29.02/8.40 


31.13/8.93 


30% 


56.67/4.47 


51.00/5.23 


43.25/4.80 


44.07/5.46 


35% 


62.93/3.64 


57.51/4.20 


53.05/3.31 


52.86/4.01 


40% 


69.25/2.85 


63.91/3.34 


62.31/2.36 


61.33/3.08 



Table 4.7: Simulation results for Xio outliers at p — 20, part 2 



Out% 


No Wins. 


No Wins. 


BACON* 


Kurtosisl* 




MNWT 


MNWT 








0.8 


0.85 






5% 


20.52/11.80 


23.18/13.52 




40.80/7.47 


10% 


22.50/10.81 


24.85/12.15 




47.95/6.64 


20% 


28.96/8.34 


32.03/9.05 




63.25/5.06 


30% 


43.10/4.80 


45.24/5.67 




72.89/4.00 


35% 


51.98/3.30 


53.87/4.30 




76.08/3.56 


40% 


62.28/2.35 


62.08/3.45 




78.95/3.01 
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Figure 4.2: Outliers from xls with Mahalanobis distances 
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Table 4.8: Simulation results for xls outliers at p = 5, part 1 



Out% 


Wins. 


Wins. 


Wins. 


Wins. 




No MNWT 


No MNWT 


MNWT 


MNWT 




0.85 


0.9 


0.8 


0.85 


5% 


3.72/4.86 


3.42/5.34 


3.35/5.11 


3.15/5.44 


10% 


5.13/3.92 


4.67/4.33 


4.58/4.03 


4.29/4.29 


20% 


10.14/2.75 


9.23/2.93 


7.50/2.73 


7.22/2.83 


30% 


24.12/2.18 


20.95/2.20 


12.12/2.29 


10.96/2.73 


35% 


46.37/1.35 


35.57/1.63 


16.92/2.36 


13.80/2.34 


40% 


72.22/0.57 


61.69/0.80 


33.83/1.97 


20.12/2.40 



Table 4.9: Simulation results for xls outliers at p = 5, part 2 



Out% 


No Wins. 


No Wins. 


BACON* 


Kurtosisl* 




MNWT 


MNWT 








0.8 


0.85 






5% 


2.89/5.71 


2.55/6.23 


79.08/0.10 


2.54/7.03 


10% 


3.82/4.66 


3.38/5.18 




2.46/6.95 


20% 


6.00/3.42 


5.46/3.69 




2.62/6.60 


30% 


10.70/3.23 


8.80/3.30 




3.29/5.76 


35% 


17.36/3.43 


12.75/3.55 




21.05/4.21 


40% 


39.81/2.47 


22.67/3.60 




70.99/1.36 
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Table 4.10: Simulation results for Xio.ee outliers ai p — 10, part 1 



Out% 


Wins. 


Wins. 


Wins. 


Wins. 




No MNWT 


No MNWT 


MNWT 


MNWT 




0.85 


0.9 


0.8 


0.85 


5% 


4.54/5.11 


3.98/5.79 


2.92/6.48 


2.84/6.73 


10% 


6.59/4.44 


5.69/4.95 


3.99/5.23 


3.79/5.58 


20% 


13.54/3.88 


11.25/4.17 


6.29/3.90 


6.01/4.21 


30% 


28.13/3.43 


22.24/3.80 


8.04/3.68 


7.98/3.81 


35% 


44.03/2.43 


31.92/3.28 


9.26/3.88 


9.32/3.90 


40% 


64.29/1.31 


47.91/2.23 


11.08/4.42 


11.46/4.31 



Table 4.11: Simulation results for Xiom outliers at p = 10, part 2 



Out% 


No Wins. 


No Wins. 


BACON* 


Kurtosisl* 




MNWT 


MNWT 








0.8 


0.85 






5% 


2.89/6.53 


2.41/7.58 


81.81/0.14 


3.48/7.79 


10% 


3.61/5.67 


3.10/6.52 




3.71/7.40 


20% 


5.09/4.76 


4.48/5.47 




5.02/6.72 


30% 


6.56/4.72 


6.08/5.16 




58.21/2.95 


35% 


7.46/5.44 


7.92/5.11 




66.31/2.19 


40% 


10.20/5.96 


9.96/6.25 




72.78/1.67 
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is still not able to detect the majority of outliers. 

Numerical results for p — 10 are presented in Tables 4.10 and 4.11. For 
the best parameter combinations (No Winsorization and MINWT), our method 
produces lower outlier as well as inlier error rates than Kurtosisl*. This differ- 
ence is especially marked at the higher contamination levels, where our method 
shows significantly superior performance. 

Results for p — 20 are displayed in Tables 4.12 and 4.13. Our algorithm 
continues to demonstrate markedly better success at outlier detection than both 
BACON* and Kurtosisl*. We can also start seeing some trends concerning our 
method's parameter combinations. It seems that for the kind of high overlap 
outhers encountered with 6'^ — 12.5 and 6'^ — 32 and lower contamination levels, 
say 20% and below, the best results are obtained for MINWT and No Winsoriza- 
tion. At the lower contamination levels, the lower cutoff of 0.8 usually produces 
an inlier error rate of a few percent lower than a cutoff of 0.85, with less than 
one percent increase in outlier error rate. It will be interesting to see if these 
trends continue for the next separation level. 

4.3.3 A Mixture of x^'s with = 50 

The outliers are now further away than before, but there is still some overlap, 
as we illustrate with Xi^ outliers in Figure 4.3. The rejection value for this data 
set in p = 5 dimensions is 35.62, leading to 24/2000 = 1.2% outher error rate 
and 184/8000 = 2.3% inlier error rate. A separation level of 5^ = 50 leads to 
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Table 4.12: Simulation results for Xg outliers at p — 20, part 1 



Out% 


Wins. 


Wins. 


Wins. 


Wins. 




No MNWT 


No MNWT 


MNWT 


MNWT 




0.85 


0.9 


0.8 


0.85 


5% 


5.01/6.50 


4.07/7.62 


1.92/11.15 


2.46/12.81 


10% 


7.74/6.05 


5.98/7.04 


2.48/9.66 


2.84/10.75 


20% 


14.70/6.16 


11.22/6.80 


3.67/7.75 


4.02/8.26 


30% 


28.28/4.88 


19.88/6.25 


4.66/7.17 


5.32/7.42 


35% 


45.52/2.67 


34.68/3.54 


5.21/7.17 


6.38/7.15 


40% 


59.90/1.62 


51.12/1.93 


7.74/5.97 


12.78/5.02 



Table 4.13: Simulation results for xl outliers at p = 20, part 2 



Out% 


No Wins. 


No Wins. 


BACON* 


Kurtosisl* 




MNWT 


MNWT 








0.8 


0.85 






5% 


2.45/9.39 


2.35/13.50 


84.18/0.16 


5.58/7.90 


10% 


2.94/8.53 


2.52/12.15 




5.87/7.71 


20% 


3.72/7.70 


3.18/10.39 




16.75/6.39 


30% 


4.24/7.79 


4.47/9.20 




55.05/3.57 


35% 


4.92/7.63 


6.89/7.55 




61.44/3.01 


40% 


8.56/5.73 


19.28/4.08 




67.28/2.31 
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Figure 4.3: Outliers from X15 with Mahalanobis distances 
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Table 4.14: Simulation results for xlb outliers at p = 5, part 1 



Out% 


Wins. 


Wins. 


Wins. 


Wins. 




No MNWT 


No MNWT 


MNWT 


MNWT 




0.85 


0.9 


0.8 


0.85 


5% 


0.57/4.36 


0.52/4.77 


0.54/4.39 


0.50/4.72 


10% 


0.86/3.38 


0.78/3.70 


0.84/3.38 


0.78/3.55 


20% 


2.49/1.92 


2.27/2.04 


1.98/1.83 


1.82/1.95 


30% 


8.73/1.83 


7.37/1.66 


4.07/1.36 


3.61/1.30 


35% 


17.97/2.01 


13.85/1.84 


6.18/1.69 


4.76/1.40 


40% 


60.18/0.88 


34.80/1.44 


11.13/2.47 


6.95/1.89 



Table 4.15: Simulation results for outliers at p = 5, part 2 



Out% 


No Wins. 


No Wins. 


BACON* 


Kurtosisl* 




MNWT 


MNWT 








0.8 


0.85 






5% 


0.44/5.05 


0.38/5.53 


43.61/0.13 


0.27/7.09 


10% 


0.61/4.00 


0.53/4.37 




0.25/7.11 


20% 


1.31/2.50 


1.16/2.71 




0.22/6.88 


30% 


3.37/2.10 


2.61/1.97 




0.26/6.31 


35% 


7.21/2.92 


4.47/2.35 




0.27/6.06 


40% 


18.99/3.59 


8.92/3.50 




0.27/6.06 
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xfg outliers at p = 5, to X12.07 outliers at p — 10, and to Xio outliers at p = 20. 

We present the simulation results in the same manner as before, starting 
with p = 5 in Tables 4.14 and 4.15. Our method usually has slightly higher 
outlier error rates than Kurtosisl*, but inlier error rates that are several percent 
lower. 

The results for p = 10 are presented in Tables 4.16 and 4.17. number of 
false positives. Knowledge and usage of the correct parameter combinations 
for our method can be important — at the higher contamination levels, the 
combination of No Winsorization and MINWT performs considerably better 
than Winsorization and No MINWT. 

We present the results for p = 20 in Tables 4.18 and 4.19. For the three 
levels of 5^ examined so far, use of the MINWT option leads to considerably 
lower error rates. For low contamination levels, the best combination seems to 
be Winsorization, MINWT and a cutoff of 0.8. For high contamination levels, it 
appears that the use of No Winsorization, MINWT and a cutoff of 0.8 usually 
produce the lower error rates. 

4.3.4 A Mixture of x^'s with 5^ = 112.5 

A separation of 5^ = 112.5 leads to xio outliers at p = 5, to X15.61 outliers at 
p = 10, and to X12.5 outhers at p = 20. We graphically illustrate the amount of 
outlier separation for X20 outliers in Figure 4.4 so the reader can compare this 
data situation with X15 and the corresponding error rates. The rejection value 
of 44.14 leads to perfect success among outliers and 0.4% error among inliers. 
The reader will notice that it is in fact possible to obtain perfect success among 
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Table 4.16: Simulation results for X12.07 outliers a,t p — 10, part 1 



Out% 


Wins. 


Wins. 


Wins. 


Wins. 




No MNWT 


No MNWT 


MNWT 


MNWT 




0.85 


0.9 


0.8 


0.85 


5% 


0.73/4.22 


0.61/4.76 


0.43/5.22 


0.40/5.52 


10% 


1.30/3.39 


1.07/3.85 


0.73/4.00 


0.67/4.31 


20% 


4.15/2.76 


3.32/2.86 


1.56/2.56 


1.47/2.77 


30% 


12.43/3.30 


9.72/3.14 


2.59/2.01 


2.53/2.09 


35% 


21.22/3.35 


16.06/3.38 


3.22/2.06 


3.13/2.04 


40% 


46.29/2.06 


28.54/2.96 


4.08/2.56 


3.90/2.36 



Table 4.17: Simulation results for xf2.07 outliers at p = 10, part 2 



Out% 


No Wins. 


No Wins. 


BACON* 


Kurtosisl* 




MNWT 


MNWT 








0.8 


0.85 






5% 


0.41/5.39 


0.31/6.30 


62.78/0.14 


0.30/7.79 


10% 


0.61/4.43 


0.51/5.13 




0.23/7.73 


20% 


1.09/3.36 


0.94/3.85 




0.33/7.31 


30% 


1.83/2.79 


1.67/3.02 




0.67/6.60 


35% 


2.52/2.98 


2.21/3.01 




1.03/5.94 


40% 


3.83/3.90 


3.26/3.71 




1.06/5.22 
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Table 4.18: Simulation results for Xio outliers at p = 20, part 1 



Out% 


Wins. 


Wins. 


Wins. 


Wins. 




No MNWT 


No MNWT 


MNWT 


MNWT 




0.85 


0.9 


0.8 


0.85 


5% 


0.88/5.14 


0.65/6.12 


0.25/9.01 


1.84/10.47 


10% 


1.85/4.62 


1.29/5.37 


0.40/7.16 


0.82/7.61 


20% 


6.22/4.68 


4.31/4.96 


0.85/4.95 


1.41/5.22 


30% 


13.49/5.73 


10.61/5.74 


1.93/3.85 


2.11/4.06 


35% 


28.33/3.28 


16.33/5.31 


2.12/3.80 


2.47/3.96 


40% 


53.94/1.21 


42.85/1.56 


2.18/4.02 


2.65/4.17 



Table 4.19: Simulation results for xlo outliers at p = 20, part 2 



Out% 


No Wins. 


No Wins. 


BACON* 


Kurtosisl* 




MNWT 


MNWT 








0.8 


0.85 






5% 


0.74/7.60 


0.56/11.75 


70.69/0.16 


0.68/8.13 


10% 


0.81/6.36 


0.68/9.24 


89.38/0.10 


0.54/8.04 


20% 


1.38/5.21 


1.02/6.82 




0.56/7.71 


30% 


1.46/4.58 


1.53/5.73 




1.18/6.78 


35% 


1.83/4.70 


1.87/5.82 




5.49/5.94 


40% 


1.67/5.12 


0.55/8.99 




23.15/4.12 
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both outliers and inliers if, after visually examining a plot of the Mahalanobis 
distances, the rejection line is visually determined by the researcher. If possi- 
ble, therefore, it is always preferable for the researcher to visually examine the 
Mahalanobis distances and ensure that the rejection line has been well chosen 
by the algorithm, but sometimes this may not be possible and an automated 
process will have to be used without human intervention. We do not recommend 
blindly increasing the rejection point by a certain factor (which would lower the 
error rate for this particular data set), because that would likely lead to much 
higher outlier error rates for highly overlapping outliers. 

We present the simulation results as before, starting with p = 5 in Tables 
4.20 and 4.21. The separation level 5^ is now sufficiently large that BACON* 
is able to detect the outliers, and it consequently produces the best overall 
performance. For low contamination levels (20% or below), it seems that the best 
combination is Winsorization and No MINWT with cutoff value 0.85 while for 
more than 20% contamination, the best combination seems to be Winsorization 
and MINWT with the same cutoff value, 0.85. 

The results for p — 10 are presented in Tables 4.22 and 4.23. For the 
lower contamination levels, all of the parameter combinations for our method 
produce zero outlier error. The only differences are to be found among the 
inlier error rates, which can be decreased by a lower cutoff value. These inlier 
error rates are consistently lower than those of Kurtosisl*. For those parameter 
combinations shown in the tables, he best overall parameter combination for 
our method seems to be Winsorization and No MINWT, with a cutoff value of 
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Figure 4.4: Outliers from xlo "with Mahalanobis distances 
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Table 4.20: Simulation results for xlo outliers at p = 5, part 1 



Out% 


Wins. 


Wins. 


Wins. 


Wins. 




No MNWT 


No MNWT 


MNWT 


MNWT 




0.85 


0.9 


0.8 


0.85 


5% 


0.07/3.00 


0.07/3.30 


0.07/3.14 


0.07/3.38 


10% 


0.08/2.40 


0.08/2.61 


0.05/2.47 


0.05/2.59 


20% 


0.11/1.14 


0.01/1.28 


0.11/1.11 


0.07/1.20 


30% 


0.60/0.29 


0.47/0.24 


0.36/0.18 


0.33/0.18 


35% 


3.10/0.85 


1.75/0.45 


0.71/0.28 


0.49/0.19 


40% 


10.87/1.90 


5.52/1.21 


2.89/1.25 


1.00/0.48 



Table 4.21: Simulation results for xlo outliers at p = 5, part 2 



Out% 


No Wins. 
MNWT 
0.8 


No Wins. 
MNWT 
0.85 


BACON* 


Kurtosisl* 


5% 


0.07/3.53 


0.06/3.88 


0.18/0.23 


0.0/7.06 


10% 


0.08/2.94 


0.07/3.25 


0.18/0.22 


0.0/7.00 


20% 


0.07/1.69 


0.06/1.82 


0.19/0.22 


0.0/6.90 


30% 


0.20/0.37 


0.18/0.36 


0.21/0.22 


0.0/6.34 


35% 


1.10/1.17 


0.39/0.58 


0.21/0.22 


0.0/6.06 


40% 


9.10/3.74 


2.67/2.27 


0.23/0.22 


0.0/5.39 
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Table 4.22: Simulation results for X15.61 outliers a,t p — 10, part 1 



Out% 


Wins. 


Wins. 


Wins. 


Wins. 




No MNWT 


No MNWT 


MNWT 


MNWT 




0.85 


0.9 


0.8 


0.85 


5% 


0.0/2.10 


0.0/2.40 


0.0/2.55 


0.0/2.71 


10% 


0.0/1.67 


0.0/1.92 


0.0/1.88 


0.0/2.02 


20% 


0.0/0.87 


0.0/1.00 


0.0/0.88 


0.0/1.00 


30% 


0.21/0.11 


0.10/0.07 


0.15/0.03 


0.15/0.04 


35% 


0.88/0.63 


0.24/0.27 


0.20/0.03 


0.20/0.03 


40% 


4.70/1.94 


1.80/1.08 


0.11/0.11 


0.0/0.03 



Table 4.23: Simulation results for X15.61 outliers at p = 10, part 2 



Out% 


No Wins. 
MNWT 
0.8 


No Wins. 
MNWT 
0.85 


BACON* 


Kurtosisl* 


5% 


0.0/2.67 


0.0/3.03 


0.0/0.26 


0.0/7.58 


10% 


0.0/2.18 


0.0/2.51 


0.0/0.26 


0.0/7.56 


20% 


0.0/1.43 


0.0/1.58 


0.0/0.26 


0.0/7.38 


30% 


0.10/0.12 


0.06/0.12 


0.0/0.26 


0.0/6.80 


35% 


0.05/0.15 


0.10/0.11 


0.0/0.26 


0.0/6.31 


40% 


0.17/1.09 


0.10/0.30 


0.0/0.26 


0.0/5.49 
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0.85. It seems likely, however, that for this relatively large separation level, even 
lower inlier error rates could be obtained with a lower cutoff value, while still 
maintaining perfect or near perfect success among the outliers. 

Results for p — 20 are shown in Tables 4.24 and 4.25. The observations 
and trends discussed above for p = 5 and p — 10 remain valid for p = 20. 
Winsorization and No MINWT with a low cutoff value (possibly lower than 
0.85) still lead to the best overall results. 

4.3.5 A Mixture of x^'s with 5^ = 200 

Finally, we examine a situation where there is very little outlier overlap, 
namely 5'^ — 200. This leads to xlb at p = 5, to Xiq.u outhers at p = 10, and to 
xfg outhers at p = 20. We expect all of the methods to successfully detect all 
of the outliers, but are particularly interested in seeing how quickly the inlier 
error rates decrease to zero. 

We present the results for p = 5 in Tables 4.26 and 4.27. The outlier overlap 
is now sufficiently small for BACON* to be able to detect all of the outliers. 

The results for p = 10 are presented in Tables 4.28 and 4.28, and for p = 20 
in Tables 4.30 and 4.28. For the low outlier overlap examined here, the best 
parameter combination continues to be Winsorization and No MINWT, with 
a low cutoff (lower than 0.85, the lowest value examined in these simulations.) 
BACON* does very well with this of widely separated distributions, with the our 
algorithm's error rates decreasing at a slower rate, as 6^ increases. Kurtosisl*, 
in contrast, has maintained the same inlier error rate (around 6% - 8%) as for 
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Table 4.24: Simulation results for X12.5 outliers a,t p — 20, part 1 



Out% 


Wins. 


Wins. 


Wins. 


Wins. 




No MNWT 


No MNWT 


MNWT 


MNWT 




0.85 


0.9 


0.8 


0.85 


5% 


0.0/1.49 


0.0/1.68 


0.0/2.19 


0.0/2.56 


10% 


0.0/1.15 


0.0/1.33 


0.0/1.56 


0.0/1.73 


20% 


0.0/0.64 


0.0/0.76 


0.0/0.71 


0.0/0.83 


30% 


0.40/0.06 


0.0/0.02 


0.0/0.0 


0.0/0.0 


35% 


0.54/0.42 


0.45/0.19 


0.0/0.0 


0.0/0.0 


40% 


1.87/1.26 


0.81/0.70 


0.0/0.0 


0.0/0.0 



Table 4.25: Simulation results for X12.5 outliers at p — 20, part 2 



Out% 


No Wins. 
MNWT 
0.8 


No Wins. 
MNWT 
0.85 


BACON* 


Kurtosisl* 


5% 


0.0/2.00 


0.0/2.98 


0.0/0.26 


0.0/8.23 


10% 


0.0/1.71 


0.0/2.16 


0.0/0.26 


0.0/8.04 


20% 


0.0/1.14 


0.0/1.31 


0.0/0.26 


0.0/7.90 


30% 


0.0/0.04 


0.0/0.04 


0.0/0.26 


0.0/7.28 


35% 


0.0/0.04 


0.0/0.04 


0.0/0.26 


0.0/6.60 


40% 


0.0/0.06 


0.0/0.05 


0.0/0.26 


0.0/5.26 
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Table 4.26: Simulation results for X25 outliers at p = 5, part 1 



Out% 


Wins. 


Wins. 


Wins. 


Wins. 




No MNWT 


No MNWT 


MNWT 


MNWT 




0.85 


0.9 


0.8 


0.85 


5% 


0.0/2.03 


0.0/2.32 


0.0/2.25 


0.0/2.43 


10% 


0.0/1.66 


0.0/1.89 


0.0/1.73 


0.0/1.89 


20% 


0.05/0.92 


0.05/1.05 


0.05/0.91 


0.06/1.00 


30% 


0.10/0.04 


0.10/0.04 


0.10/0.03 


0.10/0.04 


35% 


0.31/0.10 


0.16/0.04 


0.12/0.03 


0.11/0.02 


40% 


5.87/1.10 


1.39/0.41 


0.94/0.61 


0.12/0.12 



Table 4.27: Simulation results for X25 outliers at p = 5, part 2 



Out% 


No Wins. 
MNWT 
0.8 


No Wins. 
MNWT 
0.85 


BACON* 


Kurtosisl* 


5% 


0.0/2.59 


0.0/2.89 


0.0/0.23 


0.0/7.23 


10% 


0.0/2.19 


0.0/2.35 


0.0/0.23 


0.0/7.16 


20% 


0.05/1.36 


0.04/1.48 


0.0/0.23 


0.0/6.93 


30% 


0.05/0.09 


0.05/0.10 


0.0/0.23 


0.0/6.56 


35% 


0.11/0.20 


0.05/0.09 


0.0/0.22 


0.0/6.14 


40% 


7.46/3.28 


1.25/1.73 


0.0/0.23 


0.0/5.49 



108 



Table 4.28: Simulation results for X25 outliers at p = 10, part 1 



Out% 


Wins. 


Wins. 


Wins. 


Wins. 




No MNWT 


No MNWT 


MNWT 


MNWT 




0.85 


0.9 


0.8 


0.85 


5% 


0.0/1.39 


0.0/1.61 


0.0/1.73 


0.0/1.88 


10% 


0.0/1.20 


0.0/1.34 


0.0/1.35 


0.0/1.45 


20% 


0.0/0.69 


0.0/0.79 


0.0/0.69 


0.0/0.79 


30% 


0.0/0.01 


0.0/0.01 


0.0/0.0 


0.0/0.01 


35% 


0.10/0.05 


0.10/0.01 


0.10/0.0 


0.10/0.0 


40% 


1.95/0.95 


0.34/0.30 


0.10/0.02 


0.0/0.0 



Table 4.29: Simulation results for X25 outliers at p = 10, part 2 



Out% 


No Wins. 
MNWT 
0.8 


No Wins. 
MNWT 
0.85 


BACON* 


Kurtosisl* 


5% 


0.0/1.84 


0.0/2.06 


0.0/0.26 


0.0/7.72 


10% 


0.0/1.57 


0.0/1.77 


0.0/0.26 


0.0/7.57 


20% 


0.0/1.07 


0.0/1.18 


0.0/0.26 


0.0/7.27 


30% 


0.0/0.03 


0.0/0.04 


0.0/0.26 


0.0/6.87 


35% 


0.0/0.02 


0.0/0.02 


0.0/0.26 


0.0/6.10 


40% 


0.08/0.93 


0.0/0.13 


0.0/0.26 


0.0/5.55 
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Table 4.30: Simulation results for X25 outliers at p = 20, part 1 



Out% 


Wins. 


Wins. 


Wins. 


Wins. 




No MNWT 


No MNWT 


MNWT 


MNWT 




0.85 


0.9 


0.8 


0.85 


5% 


0.0/0.91 


0.0/1.03 


0.0/1.35 


0.0/1.61 


10% 


0.0/0.76 


0.0/0.86 


0.0/0.98 


0.0/1.13 


20% 


0.0/0.47 


0.0/0.56 


0.0/0.52 


0.0/0.61 


30% 


0.0/0.0 


0.0/0.0 


0.0/0.0 


0.0/0.0 


35% 


0.40/0.03 


0.0/0.0 


0.0/0.0 


0.0/0.0 


40% 


0.99/0.49 


0.52/0.18 


0.0/0.0 


0.0/0.0 



Table 4.31: Simulation results for X25 outliers at p = 20, part 2 



Out% 


No Wins. 
MNWT 
0.8 


No Wins. 
MNWT 
0.85 


BACON* 


Kurtosisl* 


5% 


0.0/1.27 


0.0/1.87 


0.0/0.26 


0.0/8.21 


10% 


0.0/1.10 


0.0/1.39 


0.0/0.27 


0.0/7.93 


20% 


0.0/0.80 


0.0/0.91 


0.0/0.26 


0.0/7.78 


30% 


0.0/0.0 


0.0/0.0 


0.0/0.26 


0.0/7.20 


35% 


0.0/0.0 


0.0/0.0 


0.0/0.26 


0.0/6.62 


40% 


0.0/0.03 


0.0/0.0 


0.0/0.25 


0.0/5.77 
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the past few outlier separation levels. 

4.4 Symmetric Data: A Mixture of Gaussians 

We also conducted a limited investigation into outliers drawn from a Normal 

distribution, to verify that our method also performs well in "standard" data 
situations. We investigate two types of Normal outliers. 

4.4.1 An Investigation of Shift Outliers 

In this section we investigate the performance of our method for Normally 

distributed shift outliers, but with less separation than the preliminary simu- 
lations previously discussed. Specifically, at p = 5, we generate inliers from 
Np{0,I) and outliers from Np{3 ■ 1, J), and with outliers from Np{2 ■ 1, J) in 
a second simulation. These two outlier situations produce separation levels of 
5^ — 45 and = 20, respectively. We initially examined A^p(4 • 1, J) as an out- 
lier distribution at p = 5, but for virtually all of the algorithms considered, this 
did not present a credible challenge. In Figure 4.5 we plot the first dimension 
of a 10, 000 X 5 data set with 20% outliers taken from A^p(3 • 1, /). Also shown 
are the corresponding Mahalanobis distances and rejection value of 27.70, for 
this specific data set, which leads to 9/2000 = 0.45% error among outhers and 
among inliers. 

We first examine the simulation results for p = 5 in Tables 4.32 and 4.33. 
BACON* fails to detect the outliers at all contamination levels, while Kurto- 
sisl* provides the most successful overall performance. Our method produces 
very good results up to about 30% or 35% contamination for all parameter com- 
binations, then experiences a very dramatic decrease in performance — in one 
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Figure 4.5: Outliers from Np{3 ■ 1,1) with Mahalanobis distances 
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Table 4.32: Simulation results for Np{3 -1,1) outliers at p = 5, part 1 



Out% 


Wins. 


Wins. 


Wins. 


Wins. 




No MNWT 


No MNWT 


MNWT 


MNWT 




0.85 


0.9 


0.8 


0.85 


5% 


0.01/1.04 


0.08/1.28 


0.08/1.15 


0.07/1.32 


10% 


0.17/0.85 


0.15/1.03 


0.14/0.86 


0.13/1.01 


20% 


0.86/0.59 


0.85/0.72 


0.39/0.46 


0.48/0.56 


30% 


11.43/2.17 


7.13/1.45 


1.82/0.60 


1.32/.38 


35% 








4.09/1.56 


40% 











Table 4.33: Simulation results for Np{3 ■ 1, /) outliers at p = 5, part 2 



Out% 


No Wins. 


No Wins. 


BACON* 


Kurtosisl* 




MNWT 


MNWT 








0.8 


0.85 






5% 


0.07/1.33 


0.06/1.61 




0.04/0.52 


10% 


0.11/1.11 


0.09/1.35 




0.05/0.50 


20% 


0.23/0.81 


0.27/1.00 




0.03/0.47 


30% 


2.46/1.88 


1.03/0.99 




0.04/0.41 


35% 




12.20/5.12 




0.04/0.40 


40% 








0.05/0.36 
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instance the outlier error rate jumped from 1.8% to above 90%. This sudden 
change indicates that the sample breakdown point has been reached, for this 
particular data situation. We also observe that a higher weight cutoff is more 
effective at keeping outliers out of the initial subset, resulting in better perfor- 
mances at the 30% and 35% contamination levels, while a lower cutoff provides 
the best results for less than 20% contamination by minimizing the inlier error 
rate and not unduly influencing the outlier error rate. 
For 52 = 45, the mean of the outliers, Ho^ti 

needs to be (2.12, 2.12)^ 
at p = 10. These results can be seen in Tables 4.34 and 4.35. At low contam- 
ination levels (below 20%), our method demonstrates good performance, not 
very different from Kurtosisl*, or sometimes with slightly higher error rates. 
At higher contamination levels and the use of MINWT with a cutoff of 0.85, 
it demonstrates excellent success at detecting outliers, significantly better than 
Kurtosisl*. This is the same pattern observed with the simulations. 

A separation of 5^ = 45 means that at p = 20, we need to have — 
(1.5, . . . , 1.5). We present these results in Tables 4.36 and 4.37. Again, the use 
of MINWT greatly improves the success of our algorithm, as previously observed 
when there is a high outlier overlap. The higher cutoff value of 0.85 limits the 
influence of outhers on the algorithm's location and scale estimates, helping 
to make them more robust, and leading to increased success at identifying the 
outliers. 

Although 5"^ — 45 presents a fairly challenging situation that leads to con- 
siderable overlap with the inliers, we decided to test the outlier algorithms even 
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Table 4.34: Simulation results for A^p(2.12 • 1, J) outliers a,t p — 10, part 1 



Out% 


Wins. 


Wins. 


Wins. 


Wins. 




No MNWT 


No MNWT 


MNWT 


MNWT 




0.85 


0.9 


0.8 


0.85 


5% 


0.59/1.21 


0.45/1.51 


0.33/1.70 


0.32/1.94 


10% 


1.28/1.14 


0.97/1.39 


0.57/1.32 


0.58/1.56 


20% 


8.19/1.85 


5.42/1.85 


1.25/1.01 


1.54/1.23 


30% 




64.63/0.65 


3.49/1.43 


3.32/1.32 


35% 






67.35/0.32 


6.29/2.52 


40% 











Table 4.35: Simulation results for Np{2.12 ■ 1, 1) outliers at p = 10, part 2 



Out% 


No Wins. 


No Wins. 


BACON* 


Kurtosisl* 




MNWT 


MNWT 








0.8 


0.85 






5% 


0.35/1.61 


0.31/2.07 




0.96/1.07 


10% 


0.53/1.43 


0.49/1.89 




0.93/1.09 


20% 


0.98/1.37 


1.10/1.80 




1.05/0.94 


30% 


3.43/2.42 


2.52/2.28 




11.02/0.97 


35% 


75.87/0.23 


17.31/3.36 




59.48/0.67 


40% 




88.26/0.64 







115 



Table 4.36: Simulation results for A^p(1.5 • 1, /) outliers ect p — 20, part 1 



Out% 


Wins. 


Wins. 


Wins. 


Wins. 




No MNWT 


No MNWT 


MNWT 


MNWT 




0.85 


0.9 


0.8 


0.85 


5% 


3.28/1.85 


2.21/2.33 


1.44/5.37 


10.23/11.41 


10% 


8.11/2.27 


5.05/2.61 


1.99/3.83 


4.79/9.95 


20% 


44.60/2.40 


23.74/3.72 


3.37/3.11 


5.06/5.52 


30% 




82.73/1.26 


8.25/3.78 


14.20/3.90 


35% 






68.02/0.29 


52.31/0.91 


40% 






87.70/1.41 


79.14/1.33 



Table 4.37: Simulation results for Np{1.5 ■ 1, 1) outliers at p = 20, part 2 



Out% 


No Wins. 


No Wins. 


BACON* 


Kurtosisl* 




MNWT 


MNWT 








0.8 


0.85 






5% 


1.71/3.18 


4.96/12.52 




2.32/1.14 


10% 


2.23/2.89 


4.32/9.72 




3.14/1.12 


20% 


3.36/3.08 


5.17/7.27 






30% 


9.18/4.00 


20.74/3.96 






35% 


69.62/0.34 


55.32/1.45 






40% 


86.62/1.77 


77.24/3.43 
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more by generating outliers from Np{2-1, 1) at p — 5, which imphes a separation 
level of 5'^ — 20. The first dimension of a 10, 000 x 5 data set with this outlier 
separation and 20% contamination is shown in Figure 4.6, with corresponding 
Mahalanobis distances. The rejection value for this data set is 24.84, leading 
to 286/2000 = 14.3% error among outhers and 271/8000 = 3.39% error among 
inliers. 

The numerical results for p = 5 are presented in Tables 4.38 and 4.39. We 
immediately notice that all of the error rates are considerably higher than with 
the Np{3 ■ 1,1) outliers. As before, BACON* fails to detect the outliers at all 
contamination levels. We also observe that relative to our method, Kurtosisl* 
has a hard time identifying the outliers. It fails to identify half of them at e = 
10%, and more than 90% at e = 20%. In the simulations, the corresponding 
high-overlap situation was provided by drawing outliers from Xio (Tables 4.2 
and 4.3 on p. 88). For both the Xio Normal outliers presented here, the 
best performance is given by our algorithm with the usage of No Winsorization 
and MINWT. 

The results for p — 10 are presented in Tables 4.40 and 4.41. To maintain 
5^ — 20, we need to draw outliers from A'p(1.41 ■!,/). Compared to p = 5, there 
is a marked increase in error rates for all of the methods. Usage of the MINWT 
option continues to greatly enhance the performance of our method. 

For p = 20 and 6'^ = 20, we need to set /Xq^j = 1, a coordinate-wise sep- 
aration of only one standard deviation. We present the results in Tables 4.42 
and 4.43. Already at e = 5%, Kurtosisl* fails to identify more than 80% of the 
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Figure 4.6: Outliers from Np{2 ■ 1, 1) with Mahalanobis distances 
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Table 4.38: Simulation results for Np{2 • 1, /) outliers at p = 5, part 1 



Out% 


Wins. 


Wins. 


Wins. 


Wins. 




No MNWT 


No MNWT 


MNWT 


MNWT 




0.85 


0.9 


0.8 


0.85 


5% 


13.17/1.82 


11.82/2.16 


11.49/1.99 


10.69/2.29 


10% 


17.45/1.76 


15.60/2.10 


13.90/1.80 


13.45/2.06 


20% 


37.13/1.70 


31.52/2.09 


21.06/1.69 


20.91/1.85 


30% 


88.23/0.44 


79.66/0.66 


66.49/0.88 


45.81/1.51 


35% 








87.19/0.29 


40% 











Table 4.39: Simulation results for Np{2 ■ 1, /) outliers at p = 5, part 2 



Out% 


No Wins. 


No Wins. 


BACON* 


Kurtosisl* 




MNWT 


MNWT 








0.8 


0.85 






5% 


10.62/2.24 


9.46/2.77 




16.90/0.48 


10% 


12.57/2.13 


11.56/2.64 




53.94/0.30 


20% 


18.65/2.37 


17.28/2.74 






30% 


67.86/1.29 


41.91/2.70 






35% 




83.82/0.99 






40% 
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Table 4.40: Simulation results for A^p(1.41 • 1, /) outliers a,t p — 10, part 1 



Out% 


Wins. 


Wins. 


Wins. 


Wins. 




No MNWT 


No MNWT 


MNWT 


MNWT 




0.85 


0.9 


0.8 


0.85 


5% 


25.21/2.26 


21.43/2.87 


17.80/3.17 


17.12/3.65 


10% 


34.62/2.30 


28.64/2.96 


20.99/3.07 


20.58/3.51 


20% 


67.53/1.66 


54.53/2.55 


31.18/3.04 


29.95/3.37 


30% 




85.08/1.52 


73.54/1.32 


62.53/1.61 


35% 






88.87/1.35 


82.61/1.24 


40% 











Table 4.41: Simulation results for A^p(1.41 • 1, 1) outliers at p = 10, part 2 



Out% 


No Wins. 


No Wins. 


BACON* 


Kurtosisl* 




MNWT 


MNWT 








0.8 


0.85 






5% 


18.79/2.88 


16.62/3.83 




48.78/0.94 


10% 


21.40/2.95 


19.56/3.84 






20% 


30.15/3.35 


27.09/4.21 






30% 


72.92/1.59 


61.63/2.07 






35% 


88.29/1.62 


80.50 






40% 




89.78/2.80 
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Table 4.42: Simulation results for Np{l,I) outliers at p = 20, part 1 



Out% 


Wins. 


Wins. 


Wins. 


Wins. 




No MNWT 


No MNWT 


MNWT 


MNWT 




0.85 


0.9 


0.8 


0.85 


5% 


41.72/2.96 


34.58/4.03 


27.26/8.46 


32.20/15.45 


10% 


54.81/2.94 


44.85/4.03 


31.67/7.01 


37.89/9.19 


20% 


80.85/2.20 


71.02/3.09 


49.99/4.27 


56.25/6.13 


30% 




86.70/2.88 


76.75/3.51 


73.76/5.37 


35% 






84.32/4.73 


79.90/7.10 


40% 






88.16/6.68 


85.13/8.17 



Table 4.43: Simulation results for Np{l, I) outliers at p = 20, part 2 



Out% 


No Wins. 


No Wins. 


BACON* 


Kurtosisl* 




MNWT 


MNWT 








0.8 


0.85 






5% 


29.28/5.55 


42.91/15.12 




82.24/0.95 


10% 


32.70/5.35 


38.23/8.84 






20% 


48.53/3.93 


56.67/7.96 






30% 


76.41/2.97 


71.18/7.86 






35% 


84.86/3.74 


79.68/6.59 






40% 


89.63/4.83 


83.88/8.80 
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outliers. The best performance by our method at this level of contamination 
is an outlier error rate of 27.3%, which is of course a vast improvement over 
Kurtosisl* and BACON*. This error rate is provided by the combination of 
Winsorization and MINWT with a cutoff value of 0.8, which also provides the 
best overall results for e < 20%. This is a very difficult situation for all of the 
algorithms, we feel that our algorithm still manages to preserve some measure 
of credibility, at least over the lower contamination levels. 

We thus conclude our investigation of Normally distributed shift outhers 
with a very small separation level, 5^. Our method has proven to be one of the 
frontline contenders, by either producing the best performance, or coming very 
close to the best performance. It is very important, however, to be aware of the 
general characteristics of the data set to be examined, and to accordingly select 
the correct parameter combinations. The wrong parameters can lead to vastly 
suboptimal results. For instance, when dealing with small separation levels 
such as in this particular section, the usage of the MINWT always produces 
a dramatic increase in performance. A higher cutoff value such as 0.85 will 
result in better outlier identification, but this has to be balanced against an 
increased number of false positives. The general principles for optimal parameter 
combinations discovered for these shift outliers are the same as those for heavily 
overlapping outliers. A httle knowledge of the data set to be examined can 
go a long way in yielding outstanding outlier identification success when these 
general principles are understood and applied. 
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4.4.2 A Further Investigation of Shift OutUers 

Juan and Prieto [44] present a simulation study comparing several outlier 
algorithms, including MCD and MULTOUT, but not Kurtosisl or BACON. 
All of the methods examined experienced complete or near complete failure for 
higher than 10% or 15% contamination. Both the MCD and MULTOUT failed 
at p = 20 with 10% outhers, at p = 10 with 15% outliers, and of course in all 
of the more difficult situations. Accordingly, we thought this would be a good 
test for our method. The data sets are constructed as follows. For p — 5, 10, 20, 
generate inliers from iVp(0, /), and outliers according to Np{kei, A^I), where ei 
denotes the ffist unit vector (1, 0, ... , 0)^, A = 0.1 and k is described in Rocke 
and Woodruff [66] as a suitable distance from the center of the outhers to that 
of the inliers. Since the off-diagonal covariances are zero, ^ Xp,o.95 would be 
the radius of a sphere containing 95% of the inliers, because the Mahalanobis 
distance follows a xl distribution for Normally distributed data. If the outher 
center is situated at twice this distance, the outlier and inlier spheres should 
not have much overlap. Rocke and Woodruff [66] therefore set A; = 2 Xp,o.95 
close outliers and k — 4y^Xp^o.95 distant outliers. The algorithms examined by 
Juan and Prieto [44] failed completely to detect both close and distant outhers 
for greater than 10%-15% contamination. Note that besides this difference 
in separation distance k, the outliers here also differ from those in the previous 
section in that only the first component is displaced instead of all p components. 
These outliers are situated further away, but in only one component. The first 
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component of these outliers also has a smaller variance. Juan and Prieto [44] 
consider outlier contamination levels of 5%, 10%, 15% and 20%, and n — lOp. 
With the exception of 15%, we employ the same contamination levels, but since 
we are more interested in large data sets, set n — 10, 000 as before. 

Prom previous results, we have sufficient knowledge to predict which pa- 
rameter combinations will enable our method to produce the least possible error 
rate. Accordingly, it will not be necessary to examine the other parameter 
combinations, and we can focus our attention on the best combination. Con- 
tamination levels are relatively low, and the amount of outlier overlap is not 
very high, so the combination of Winsorization and No MINWT should demon- 
strate the best overall success. We expect the outlier error rate to be very low 
for most cutoff values (probably less than 1%), so we can focus on minimizing 
the inlier error rate. Previously, we examined cutoff values of 0.85 and 0.9 for 
this parameter combination; 0.85 yielded the lowest inlier error rate. Thus, we 
decided to investigate only the combination of Winsorization and No MINWT, 
and also to examine 2 additional cutoff values: 0.7 and 0.75. We include Kurto- 
sisl* in this simulation study, but not BACON*, since it showed almost perfect 
results, which should not be too surprising, since BACON* usually does well in 
Normally distributed data, especially when there is not much outlier overlap. 
During the execution of these simulations, we discovered that all of the methods 
demonstrated perfect success amongst the outliers, so there is no need to present 
these error rates. 
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The inlier error rates are presented in Table 4.44, based on 1,000 iterations 
for our method and 10 for Kurtosisl*. The lower cutoff values produce smaller 
inlier errors than the higher cutoff values, i.e. cutoff =0.7 always led to our 
algorithm's best performance; this is the same trend previously observed. For 
more difficult data situations, however, the lower cutoff values will lead to higher 
outlier error rates, which might not be desirable. For close outliers in p = 5 di- 
mensions, Kurtosisl* outperforms our method; for distant outliers in the same 
dimension, performances are approximately equal, and within the same confi- 
dence interval, as is the case for close outliers in p = 10 dimensions. Our method 
starts demonstrating superior results for distant outliers in p = 10 dimensions, 
and produces markedly better success at both close and distant outliers in p = 20 
dimensions. It is hard to draw conclusions for the optimal parameter combina- 
tions based on this generally unchallenging simulation experiment, but at least 
we were able to demonstrate the superiority of our method to those methods 
examined by Juan and Prieto [44] such as the MCD and MULTOUT. 

4.5 Correlated Data 

The data examined up to this point have been uncorrelated. Although 
this is ideal, there is no guarantee that actual data sets have zero correlation. 
Sometimes correlated data can influence the statistical analysis in an undesirable 
manner. We therefore tested and compared the performance of our method on 
two types of correlated data — a mixture of correlated multivariate normals. 
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Table 4.44: Inlier error rates for shift outliers as described in [44] 



WUX/O 


P 


H 


CtJ 


CtJ 


CtJ 


CtJ 


ivurxosisi 








n 7 


U. ( o 


U.oO 


n Q 




5% 


5 


6.65 


1.37 


1.43 


1.73 


2.18 


0.52 






13.31 


0.59 


0.61 


0.71 


0.82 


0.52 




10 


8.56 


1.06 


1.14 


1.44 


1.82 


1.08 






17.11 


0.36 


0.38 


0.46 


0.56 


1.10 




20 


11.21 


0.87 


0.92 


1.20 


1.58 


1.17 






22.42 


0.26 


0.27 


0.33 


0.41 


1.17 


10% 


5 


6.65 


1.35 


1.41 


1.69 


2.12 


0.50 






13.31 


0.57 


0.69 


0.67 


0.79 


0.50 




10 


8.56 


1.11 


1.17 


1.48 


1.91 


1.07 






17.11 


0.37 


0.39 


0.47 


0.59 


1.07 




20 


11.21 


0.95 


1.00 


1.31 


1.73 


1.16 






22.42 


0.28 


0.30 


0.35 


0.45 


l.lb 


20% 


5 


6.65 


1.35 


1.42 


1.76 


2.28 


0.48 






13.31 


0.53 


0.54 


0.63 


0.78 


0.48 




10 


8.56 


1.31 


1.39 


1.73 


2.28 


0.98 






17.11 


0.43 


0.44 


0.53 


0.68 


0.98 




20 


11.21 


1.24 


1.30 


1.69 


2.27 


1.11 






22.42 


0.35 


0.36 


0.43 


0.51 


1.11 
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and a mixture of correlated Poissons. 

4.5.1 A Mixture of Correlated Normals 

An important decision for generating correlated random data is the choice 
of covariance matrix S. If the off-diagonal elements are randomly generated 
(for instance, using a uniform distribution), there is a distinct possibility that 
the matrix will not be positive definite, a very critical property of all covariance 
matrices. Although it is possible to convert a given matrix to another matrix 
which is positive definite, we preferred to have more control over the exact 
structure of the covariance matrix. Accordingly, we selected a S that has I's on 
the diagonal and 0.5 elsewhere: 



1 


0.5 


0.5 


0.5 


0.5 


0.5 


1 


0.5 


0.5 


0.5 


0.5 


0.5 


1 


0.5 


0.5 


0.5 


0.5 


0.5 


1 


0.5 


0.5 


0.5 


0.5 


0.5 


1 



for the case p — h and similarly for p = 10, 20. We henceforth denote these 
respective matrices as C5, Cio and C2o- One of the reasons for utilizing a 
multivariate normal distribution, is so that we can directly compare results with 
the uncorrelated shift outhers examined in section 4.4.1. It would therefore be 
good to employ a similar separation value 5^. Recalling that 5^ — 20 provided a 
very challenging situation for all of the methods, we therefore selected |x^^ such 
that 52 ^ 20 at p = 5, 10, 20. At p = 5, this implies that = (3.46, . . . , 3.46), 
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at p = 10 we need to set n^^^ — (3.317, . . . , 3.317), at for p = 20 we need to have 
= (3.24,..., 3.24). 

We start by presenting the results for p = 5 in Tables 4.45 and 4.46. BA- 
CON* is not able to identify the outliers. Our algorithm demonstrates markedly 
better outlier identification properties than Kurtosisl* for all parameter com- 
binations. Despite the fact that BACON* and Kurtosisl* experience difficulty 
detecting the outliers in these data sets, our algorithm typically produces error 
rates of 2%-3% for e = 5%, 10%. 

The results for p = 10 are shown in Tables 4.47 and 4.48. At the lower 
contamination levels, Kurtosisl* produces higher error rates than at p = 5; 
the outlier error rates for our algorithm drop slightly (roughly 0.5%) with a 
corresponding slight increase of the inlier error rates. This is the same pattern 
observed with constant 5^ throughout the uncorrelated simulations. 

This trend continues for p = 20, shown in Tables 4.49 and 4.50. BACON* 
is still not able to identify the outliers, while Kurtosisl* has an error rate of 
over 80% at e = 5% and over 90% otherwise. In contrast, for certain parameter 
combinations at e = 5%, our method has an outlier error rate of less than 1% 
— it identifies almost all of the outliers, for an inlier error rate of approximately 
5% (or for an outher error rate of 1.5%, the inlier error rate drops to 3.5% 
when using a lower cutoff value). Our algorithm appears to perform well on 
correlated data, demonstrating low absolute error rates, as well as significantly 
better relative performance compared to the other algorithms examined here. 
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Table 4.45: Simulation results for A^p(3.46 • 1, C5) outliers at p = 5, part 1 



Out% 


Wins. 


Wins. 


Wins. 


Wins. 




No MNWT 


No MNWT 


MNWT 


MNWT 




0.85 


0.9 


0.8 


0.85 


5% 


2.14/2.24 


1.55/2.95 


2.33/ 2.18 


1.78/2.68 


10% 


2.87/1.80 


2.23/2.34 


3.12/1.76 


2.58/2.14 


20% 


5.82/1.25 


5.44/1.65 


6.21/1.21 


6.21/1.52 


30% 


21.98/1.16 


16.54/1.49 


15.11/0.99 


12.75/1.21 


35% 


89.55/0.09 


61.69/0.63 




39.64/0.90 


40% 











Table 4.46: Simulation results for A^p(3.46 • 1, C5) outliers at p = 5, part 2 



Out% 


No Wins. 


No Wins. 


BACON* 


Kurtosisl* 




MNWT 


MNWT 








0.8 


0.85 






5% 


2.77/2.08 


1.92/2.72 




16.32/0.47 


10% 


3.86/1.72 


2.76/2.28 




48.09/0.31 


20% 


8.35/1.45 


7.09/1.77 






30% 


37.93/1.98 


21.51/1.99 






35% 




86.82/0.33 






40% 
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Table 4.47: Simulation results for iVp(3.317 • 1, Cio) outliers at p — 10, part 1 



Out% 


Wins. 


Wins. 


Wins. 


Wins. 




No MNWT 


No MNWT 


MNWT 


MNWT 




0.85 


0.9 


0.8 


0.85 


5% 


2.27/2.21 


1.45/3.24 


2.03/2.56 


1.34/3.57 


10% 


2.99/1.85 


2.05/2.70 


2.88/2.09 


2.03/2.92 


20% 


6.15/1.34 


5.35/1.89 


6.47/1.58 


6.02/2.15 


30% 


26.86/1.21 


18.01/1.67 


14.64/1.15 


13.79/1.65 


35% 




64.25/0.61 


82.23/0.18 


30.92/1.28 


40% 











Table 4.48: Simulation results for A^p(3.317 • 1, Cio) outliers at p = 10, part 2 



Out% 


No Wins. 


No Wins. 


BACON* 


Kurtosisl* 




MNWT 


MNWT 








0.8 


0.85 






5% 


2.30/2.47 


1.45/3.55 




48.36/0.93 


10% 


3.51/2.03 


2.25/2.91 






20% 


9.62/1.73 


7.14/2.34 






30% 


38.27/1.82 


23.41/2.27 






35% 




77.86/0.50 






40% 
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Table 4.49: Simulation results for A^p(3.24 • 1, C20) outliers ai p — 20, part 1 



Out% 


Wins. 


Wins. 


Wins. 


Wins. 




No MNWT 


No MNWT 


MNWT 


MNWT 




0.85 


0.9 


0.8 


0.85 


5% 


2.65/2.10 


1.44/3.40 


1.48/3.53 


0.88/5.16 


10% 


3.55/1.75 


2.08/2.89 


2.23/2.94 


1.41/4.32 


20% 


6.98/1.36 


5.41/2.08 


6.51/2.13 


5.38/3.14 


30% 


34.43/0.95 


19.81/1.74 


15.44/1.45 


15.35/2.37 


35% 




71.26/0.40 


69.50/0.31 


29.17/1.79 


40% 






82.70/12.96 





Table 4.50: Simulation results for A^p(3.24 • 1, C20) outliers at p = 20, part 2 



Out% 


No Wins. 


No Wins. 


BACON* 


Kurtosisl* 




MNWT 


MNWT 








0.8 


0.85 






5% 


1.57/3.52 


0.91/5.27 




81.98/0.98 


10% 


2.63/2.81 


1.53/4.32 






20% 


9.74/2.15 


6.31/3.29 






30% 


40.88/1.64 


25.49/2.75 






35% 


88.88/0.63 


69.84/0.73 






40% 
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4.5.2 A Mixture of Correlated Poissons 

In chapter 3 (p. 51) we examined the effect of correlated Poisson data on the 
covariance estimate. It would be worthwhile to examine the overall performance 
of our algorithm on this type of data. We again use the method of generating 
correlated Poisson data detailed in [42, ch. 37] and briefly described on p. 51. 
For the inliers, we generate inliers from Poisson(lO) and use a covariance matrix 
with 20's on the diagonal and lO's elsewhere. In p = 5 dimensions, this matrix 
is 



20 


10 


10 


10 


10 


10 


20 


10 


10 


10 


10 


10 


20 


10 


10 


10 


10 


10 


20 


10 


10 


10 


10 


10 


20 



and similarly for the other dimensions. In chapter 3 we generated outliers from 
Poisson(40), but this hardly presents a challenge for most methods as they have 
close to perfect success. In terms of the separation level S"^, it would be ideal to 
select a separation level comparable to other distributions previously examined, 
so that the speciflc effect of Poisson distributed data can be examined. We 
observed with correlated multivariate normal data that 6^ = 20 provided a good 
challenge, so we decided to use a similar separation value for the Poisson data. 
Setting 5^ = 18.75 allows us to use an integer mean to generate outliers - from 
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Table 4.51: Simulation results for Poisson(25) outliers at p = 5, part 1 



Out% 


Wins. 


Wins. 


Wins. 


Wins. 




No MNWT 


No MNWT 


MNWT 


MNWT 




0.85 


0.9 


0.8 


0.85 


5% 


2.25/2.09 


1.72/2.71 


2.12/2.32 


1.85/2.48 


10% 


2.66/1.78 


2.15/2.23 


2.70/1.75 


2.27/2.06 


20% 


6.03/0.88 


5.70/1.07 


5.39/0.89 


5.61/0.99 


30% 


14.02/0.60 


16.40/0.82 


9.63/0.59 


10.30/0.70 


35% 


39.94/0.32 


31.74/0.48 


15.68/0.53 


14.23/0.55 


40% 


83.47/0.04 


78.84/0.07 


48.44/0.32 


30.53/0.47 



Table 4.52: Simulation results for Poisson(25) outliers at p = 5, part 2 



Out% 


No Wins. 


No Wins. 


BACON* 


Kurtosisl* 




MNWT 


MNWT 








0.8 


0.85 






5% 


2.15/2.37 


1.88/2.50 




8.14/0.95 


10% 


2.96/1.75 


2.28/2.13 




10.80/0.84 


20% 


6.93/1.06 


6.00/1.13 




50.72/0.27 


30% 


12.00/1.06 


10.59/1.00 




77.49/0.08 


35% 


29.35/1.26 


19.69/1.30 




82.90/0.03 


40% 


69.86/0.29 


41.54/0.87 




87.20/0.02 
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Poisson(25) at p = 5, and from Poisson(24) a,t p — 20. 

The results for p = 5 are shown Tables 4.51 and 4.51. We notice that the 
error rates for our method are approximately the same as for the correlated 
multivariate normal data, this is consistent with a similar separation level S^. 
Kurtosisl* demonstrates much better results than with the multivariate nor- 
mal data, although it is still significantly worse than our proposed algorithm. 
BACON* is not able to identify the outliers in these data sets. 

To maintain 5'^ — 18.75 at p = 10, we need to generate outhers from Pois- 
son(24.36). These results are shown in Tables 4.53 and 4.54. We notice that 
our algorithm as well as Kurtosisl*, the outlier error rates are somewhat lower 
than at p = 5, while the inlier error rates are slightly higher. For nearly all 
parameter combinations, our method correctly identifies more than 99% of the 
outhers at e = 5% contamination. With the use of the MINWT option, the 
outlier error rate of our method is almost always below 10% for contamination 
levels e < 35%. 

Finally, in Tables 4.55 and 4.56, we examine the results for p = 20, where 
the outliers are generated from Poisson(24). Outlier error rates are lower for 
all of the methods, and BACON* 's outlier error rate drops below 90% for the 
first time in this examination of correlated data. Especially with the use of the 
MINWT option, our algorithm demonstrates excellent outher identification suc- 
cess, regarding both low absolute error rates and relative performance compared 
to other algorithms. 
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Table 4.53: Simulation results for Poisson(24.36) outliers at p = 10, part 1 



Out% 


Wins. 


Wins. 


Wins. 


Wins. 




No MNWT 


No MNWT 


MNWT 


MNWT 




0.85 


0.9 


0.8 


0.85 


5% 


1.09/1.98 


0.72/2.83 


0.80/2.80 


0.62/3.13 


10% 


1.28/1.73 


0.90/2.39 


1.04/1.99 


0.78/2.64 


20% 


3.58/0.84 


3.21/1.07 


3.07/0.89 


3.20/1.12 


30% 


8.59/0.65 


10.82/1.04 


6.17/0.71 


6.71/0.94 


35% 


18.74/0.67 


16.10/0.85 


7.29/0.70 


7.19/0.54 


40% 


56.38/0.34 


54.37/0.34 


13.65/0.92 


11.23/0.89 



Table 4.54: Simulation results for Poisson(24.36) outliers at p = 10, part 2 



Out% 


No Wins. 


No Wins. 


BACON* 


Kurtosisl* 




MNWT 


MNWT 








0.8 


0.85 






5% 


0.81/2.75 


0.63/3.07 




7.48/1.35 


10% 


1.13/1.89 


0.75/2.67 




9.61/1.25 


20% 


3.97/1.05 


3.15/1.20 




18.05/0.96 


30% 


7.04/1.11 


6.20/1.12 




55.55/0.28 


35% 


11.04/1.65 


9.85/1.73 




66.68/0.18 


40% 


22.59/1.62 


13.35/1.92 




74.85/0.06 
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Table 4.55: Simulation results for Poisson(24) outliers at p = 20, part 1 



Out% 


Wins. 


Wins. 


Wins. 


Wins. 




No MNWT 


No MNWT 


MNWT 


MNWT 




0.85 


0.9 


0.8 


0.85 


5% 


0.37/1.75 


0.20/2.80 


0.16/3.82 


0.11/4.46 


10% 


0.46/1.58 


0.27/2.44 


0.23/2.60 


0.16/3.84 


20% 


1.61/0.72 


1.29/1.04 


1.26/0.91 


1.26/1.49 


30% 


4.31/0.53 


5.41/1.01 


3.45/0.78 


3.84/1.24 


35% 


8.39/0.77 


7.38/0.86 


3.38/0.51 


3.95/0.80 


40% 


13.09/1.24 


12.98/1.51 


4.50/0.76 


5.72/1.32 



Table 4.56: Simulation results for Poisson(24) outliers at p = 20, part 2 



Out% 


No Wins. 


No Wins. 


BACON* 


Kurtosisl* 




MNWT 


MNWT 








0.8 


0.85 






5% 


0.16/3.80 


0.11/4.40 


76.66/0 


3.64/1.32 


10% 


0.23/2.52 


0.15/3.85 


88.91/0 


4.74/1.28 


20% 


1.37/1.01 


1.02/1.48 




5.47/1.16 


30% 


3.09/1.02 


2.72/1.17 




9.92/0.87 


35% 


3.96/1.36 


3.37/1.20 




16.20/0.65 


40% 


4.89/1.78 


4.55/1.69 




32.13/0.28 
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4.6 Examination of a Real Data Set 

Simulated data sets are good because they can be tailored to measure cer- 
tain specific abilities of an algorithm, such as performance on data containing 
a specified amount of skewness or a specified amount of correlation, or with 
outliers at a specified separation level from the inliers. Simulated data sets are 
somewhat artificial, though, in the sense that they can rarely measure all of 
the characteristics contained in real data sets. Not infrequently, real data sets 
present additional problems such as scaling or missing data, which are not found 
in simulated data. We therefore tested our method on such a data set, which 
we will simply call the Wire data set. The Wire data consists of measurements 
of p = 6 trace elements on n — 47, 443 pieces of wire. These six elements, with 
their chemical symbols, are tin (Sn), nickel (Ni), copper (Cu), silver (Ag), anti- 
mony (Sb), and zinc (Zn). The set of pairwise scatterplots are shown in Figure 
4.7. Since a large number of these wire pieces have extremely small amounts of 
the various elements, a good transformation that would reveal more information 
is the log transform. We therefore depict the pairwise scatterplots of the log of 
amount of measured element in each piece of wire in Figure 4.8. The second fig- 
ure clearly reveals more information regarding the occurrence of these elements. 
Besides the rescaling, another issue that we addressed was the large number of 
missing values. In this case, a missing value could probably be interpreted as 
zero occurence, but it would be better to assign the particular observation a 
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Figure 4.7: Pairwise scatterplots of wire elements 
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Figure 4.8: Pairwise scatterplots of log(wire elements) 
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weight of zero, so that it would never affect any of the computed statistics. 

We accordingly apphed our algorithm to the scaled data, i.e. to log(amount 
of trace element measured in piece of wire). The scaled robust Mahalanobis 
distances, together with the estimated density function, are shown in Figure 4.9. 
The rejection value is also shown, as a horizontal hne in the Mahalanobis distance 
graph and as a vertical line in the density graph. It appears that there is a group 
of observations containing none or almost none of these elements (this group 
contains several subdivisions), and another group of wire pieces that contain 
at least a certain amount of elements. There is a large separation between 
the two groups and the algorithm does not have any difficulty determining a 
suitable rejection line. The robust location estimate is 0, so points with large 
Mahalanobis distances arc those that do contain measurable amounts of these 
elements. It is more interesting to analyze wire pieces that fall into the latter 
category, so in Figure 4.10 we plot the Mahalanobis distances with correposnding 
density of only those points with Mahalanobis distance greater than the cutoff 
value. There is some interesting structure in this group, with a very low density 
peak towards the left boundary of the graph, with another medium peak followed 
by the highest peak on the graph. It would be interesting to investigate the 
reason for the three peaks and why the occurrence of these elements is not 
Gaussian, though such an investigation would require a much more thorough 
knowledge of the data set than we currently have available. Nevertheless, our 
analysis has highlighted the presence of several groups or clusters within this 
data set, and with further research, indicates that it might be possible to use 
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our method in certain types of clustering algorithms or discrimation analysis. 

4.7 Computational Time 

A very important aspect of outlier identification algorithms is the time taken 
to produce an answer. The ability to find the correct answer is meaningless if it 
takes orders of magnitude longer than other comparable methods. We therefore 
measured computation times for a range of n and p values for BACON*, Kurto- 
sisl* and our method. The specific computer we used was skylla.cudenver.edu, 
a Pentium 4 system using Redhat Linux 7.3, with a speed of 2.2 Ghz and 2GB 
RAM. We utilized the Fortran command etime to measure the CPU time nec- 
essary to complete the algorithm. 

Computation times for our method are shown in Table 4.57. Note that 
the bottom right entry (n = 1, 000, 000, p = 100) had to be run using sin- 
gle precision, because double precision array sizes required more memory than 
was available and the variable declaration statements at the beginning of the 
program produced a core dump. (We created a single precision version of our 
program, and verified that single and double precision took approximately the 
same computational time for other table entries.) 

With the exception of very small data sets, computational times for our 
method increase linearly with n and also with p, for p < 50. It is not reason- 
able to expect an exactly linear increase in computation time for all p, because 
all distance-based methods require estimation of the covariance matrix which 
contains roughly unique entries, specifically, siP±ll_ jf calculation of the co- 
variance matrix is the only portion of the algorithm which increases nonlinearly 
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Figure 4.9: Robust Mahalanobis distances for all wire elements 
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Figure 4.10: Robust Mahalanobis distances for non-zero wire elements 
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Table 4.57: Computation times for our algorithm, in seconds 





p — 5 


p = 10 


p = 20 


50 


p = 100 


100 


0.00097 


0.00169 


0.00355 


0.0124 


0.0394 


n= 1,000 


0.00765 


0.0146 


0.032 


0.119 


0.37 


n = 10,000 


0.076 


0.15 


0.324 


1.19 


3.71 


n = 100, 000 


0.77 


1.51 


3.34 


12.2 


37.9 


n = 1, 000, 000 


7.78 


15.05 


33.66 


122.8 


392 



in p, as with our method, the overall computational time will increase approxi- 
mately linearly for small p and with a small coefficient of for larger p. It is not 
possible to obtain a covariance estimate any faster than the sample covariance, 
because this consists of straightforward summations and products whereas other 
methods typically require a partial sorting or other more complex techniques. It 
is also worth mentioning that during the coding of our algorithm into Fortran, 
we paid very little attention to specialized programming techniques designed for 
rapid execution. There are almost certainly a number of routines which could 
accelerate the algorithm if special attention was paid to optimizing the Fortran 
code. However, this was not the primary purpose of our investigation, so we will 
leave that for future work. 

A portion of the measured computing time is used to generate the data. We 
generated Normal data for these simulations because this is the simplest data 
to generate, besides the Uniform distribution. Generating data, for instance, 
is based on random Gamma deviates, for which we (as previously described) 
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utilize a nontrivial, modified rejection algorithm described in [2]. For example, 
generating ^ data instead of Normal data leads to a computation time of 129 
seconds vs. 122.8 seconds for n = 1, 000, 000 and p = 50 (roughly 5% of the 
total computation time). 

Table 4.58 shows computation times for BACON* obtained under the same 
circumstances. We note that for small n and p, BACON* is actually faster than 
our method. It demonstrates a slightly more than linear increase of computa- 
tion time, however, so that for large n and p it is slower than our method by 
approximately a factor of 1.5-1.7. This is not a big difference, but could become 
important for data sets of size 1,000,000 x 100 and larger. For extension to 
massive data sets, the trends developed in Tables 4.57 and 4.58 are most im- 
portant. For large data sets, BACON* slows down considerably more than our 
method. 

Computation times for Kurtosisl* are shown in Table 4.59. The unfilled 
entries are indicated by a dash because the expected computation time exceeds 
one hour. Existing entries in this table increase almost exactly with p^, and by 
a factor slightly greater than hnear with n. It is evident that Kurtosisl* does 
not extend well at all to large data sets, for either n or p. It can hardly be called 
feasible for any data set of size 10,000 x 20 and larger. As with BACON*, I 
do not claim that my particular Fortran implementation of Pena and Prieto's 
algorithm [60] is the most efficient (it almost certainly is not) but since the 
same programmer [MW] coded all three algorithms — BACON*, Kurtosisl* 
and our method — the efficiency of the Fortran code should be roughly the 
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Table 4.58: Computation times for BACON*, in seconds 





p — 5 


p = 10 


p = 20 


p= 50 


p = 100 


n= 100 


0.00063 


0.00134 


0.00164 


0.0091 


0.025 


n = 1,000 


0.0084 


0.015 


0.029 


0.115 


0.54 


ri = 10.000 


0.10 


0.153 


0.31 


1.27 


3.91 


n = 100, 000 


1.15 


1.98 


3.98 


16.55 


92.5 


n^l, 000, 000 


19.8 


25.5 


57.7 


187 


624 



Table 4.59: Computation times for Kurtosisl*, in seconds 





p = 5 


p= 10 


p = 20 


p = 50 


p = 100 


n= 100 


0.044 


0.11 


0.51 


5.3 


64 


n = 1,000 


1.2 


2.0 


9.0 


107 


690 


n = 10, 000 


17 


31 


182 


2,090 




n = 100, 000 


280 


493 


2,540 






n = 1,000,000 


3,160 
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same in all three cases. For a researcher to consider Kurtosisl*, it would have 
to demonstrate markedly superior outlier identification properties, which it does 
not, based on our simulation results, because for approximately the same success, 
it would be much better to use another method. 

In summary, our method shows excellent computational speed and extends 
very well to large data sets — better than any other method we examined. It is 
possible that other methods not examined could be faster, but we are confident 
that any such algorithm would not be able to match our algorithm's outlier 
success rates. We would have no hesitation recommending it for use in very large 
data sets, such as n > 1, 000, 000 and p > 100, provided the software can handle 
arrays of this size without specialized programming techniques. BACON* is 
almost as fast as our method, but is somewhat slower on large and very large 
data sets. Kurtosisl* would be useful for n < 10, 000 and p < 20, but we cannot 
recommend it for general use otherwise, unless the code could be substantially 
optimized. 

4.8 Predicting Our Model's Performance 

It would be very useful to be able to predict our model's success for a given 
situation. We therefore fit a regression model to some of the previous simulation 
results, with percent error as predicted output; two separate regressions for 
inlier and outlier errors. In real data sets, the most commonly encountered 
contamination levels are usually below 20%, hence we decided to focus on the 5% 
- 20% range of outlier occurrence. The usage of the Winsorization and MINWT 
options that provide the least error usually change around 20% contamination. 
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so we will be able to keep these parameters fixed at their optimum levels. For 
these relatively low contamination levels and moderate to light outlier overlap, 
the lowest error rates are obtained when using the combination of Winsorization 
and No MINWT. We incorporate the weight cutoff ctf as a variable, and also 
examine the two additional values 0.7 and 0.75 to investigate the effect of a 
lower cutoff (recall that previously we had considered cutoff values of 0.85 and 
0.9). We examine Chi-squared outliers drawn from xIq, Xia^ Xis- We do not 
scale the outlier mean fi^^^ as before, since our goal here is to not to compare 
performance at different dimensions, but rather to predict error rate for a given 
data set. We therefore let 5c signify the separation from the xl inhers regardless 
of dimension, and consider the values 5c = 5, 8, 10. We investigate dimension as a 
third factor in this study, with the values p = 5, 10, 20, while the fourth variable 
is percent contamination, e. We fit the model for e — 5%, 10%, 20% and all other 
factor combinations, then test this model's performance by predicting error rates 
for e = 15%. The outlier error rates for this model are presented in Tables 4.60 
and 4.61. For comparison (not for model fitting), we also display the error rates 
from previous simulations when adding the MINWT option, for a cutoff of 0.85, 
but not for e = 15%, since that wasn't previously calculated. Recall that when 
5c — 5 (very high outlier overlap), the MINWT option gives the best success 
rates, but it is interesting to specifically compare it to No MINWT and see how 
No MINWT starts yielding lower error rates as the overlap decreases. 

We depict these outlier error rates graphically in Figures 4.11, 4.12 and 4.13, 
which represent the respective cases 5c — 5, 8, 10. The trends in error rates are 



148 



Table 4.60: Outlier error rates for regression model: e = 5%, 10% 



f 


'I 


P 


ctf. 7 


ctf. 75 


ctf. 85 


ctf 9 


MNWT 


5% 


5 


5 


28.83 


27.67 


26.15 


24.36 


23.15 






10 


12.40 


11.10 


9.45 


8.35 


6.10 






20 


1.96 


1.49 


1.29 


1.01 


0.68 




8 


5 


4.25 


3.99 


3.72 


3.42 


3.15 






10 


0.27 


0.22 


0.26 


0.24 


0.19 






20 


0.0 


0.0 


0.0 


0.0 


0.0 




10 


5 


0.75 


0.67 


0.57 


0.52 


0.50 






10 


0.01 


0.0 


0.01 


0.07 


0.06 






20 


0.0 


0.0 


0.0 


0.0 


0.0 


10% 


5 


5 


34.71 


33.25 


30.75 


28.87 


26.75 






10 


17.52 


15.51 


12.35 


10.95 


7.62 






20 


4.65 


3.50 


2.49 


1.89 


0.82 




8 


5 


6.40 


5.87 


5.13 


4.67 


4.29 






10 


0.67 


0.51 


0.43 


0.37 


0.24 






20 


0.01 


0.0 


0.24 


0.30 


0.06 




10 


5 


1.19 


1.06 


0.86 


0.78 


0.78 






10 


0.02 


0.01 


0.08 


0.10 


0.07 






20 


0.0 


0.0 


0.0 


0.0 


0.0 
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Table 4.61: Outlier error rates for regression model: e = 15%, 20% 



r 


'I 


P 


ctf. 7 


ctf. 75 


ctf. 85 


ctf 9 


MNWT 


15% 


5 


5 


41.93 


40.11 


35.77 


33.66 


— 






10 


24.23 


21.31 


16.23 


14.10 


— 






20 


8.94 


6.77 


3.65 


2.55 


— 




8 


5 


9.49 


8.52 


6.85 


6.42 


— 






10 


1.82 


1.29 


0.66 


0.54 


— 






20 


0.07 


0.03 


0.0 


0.0 


— 




10 


5 


2.10 


1.79 


1.39 


1.27 


— 






10 


0.07 


0.04 


0.02 


0.01 


— 






20 


0.0 


0.0 


0.0 


0.0 


— 


20% 


5 


5 


50.80 


48.39 


43.47 


40.57 


36.56 






10 


33.94 


29.72 


21.89 


18.61 


10.88 






20 


14.67 


11.49 


6.92 


4.94 


1.41 




8 


5 


15.67 


13.32 


10.14 


9.23 


7.22 






10 


4.72 


3.37 


1.72 


1.34 


0.64 






20 


0.81 


0.36 


0.54 


0.38 


0.22 




10 


5 


4.31 


3.47 


2.49 


2.27 


1.82 






10 


0.45 


0.25 


0.13 


0.15 


0.16 






20 


0.0 


0.0 


0.22 


0.10 


0.0 
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Actual Outlier Error Rates, delta=5 




epsilon 



Figure 4.11: Outlier error rates for 5c — 5 
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Actual Outlier Error Rates, delta=8 




epsilon 



Figure 4.12: Outlier error rates for Sc — 8 
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Actual Outlier Error Rates, delta=1 0 




5 10 15 20 

epsilon 



Figure 4.13: Outlier error rates for 5c — 10 
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clearly evident, and not unexpected. The outlier error rates increase with e, 
and decrease with 5c, p and ctf. The trends for e and 5c are self-explanatory; 
we account for the trend in ctf as follows. Raising the cutoff value leads to 
less observations being included in the initial subset, so the variance estimate 
is smaller, hence the robust Mahalanobis distance for outlying points will be 
larger, resulting in increased rejection of outliers and lower error. The trend 
of decreasing error rate with increasing p was observed in all of our previous 
simulations, as well as in BACON*, but not in Kurtosisl*. It could be an 
encouraging indicator of our method's performance in high dimension. 

The inlier error rates are presented numerically in Tables 4.62 and 4.63 and 
graphically in Figures 4.14, 4.15 and 4.16. We notice from these figures that 
with the exception of 5c and p, the observed trends are opposite to those of the 
outlier error rates. This points to the intuitive fact that changing an algorithm 
parameter such as the cutoff ctf to decrease one type of error, will usually 
increase the other type of error. 

To help us decide which variables to put into the regression model and 
especially which interactions to include, we ran an ANOVA model on Splus for 
each of the two error types and the four factors p, 5c-, €,ctf. For the outlier error 
model, all the 2-way interactions were significant except ctf xp, while for the 
inlier error model, this interaction as well as e x p were not significant. 

Because the response variable (error rate) is a number between zero and one, 
after dividing by 100, the best results can be obtained from a logistic regression 
model. The logistic regression model is a generahzed linear model (GLM) [54] 
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Table 4.62: Inlier error rates for regression model: e = 5%, 10% 



r 


(I 


P 


ctf. 7 


ctf. 75 


ctf. 85 


ctf 9 


MNAVT 


5% 


5 


5 


5.81 


5.92 


6.23 


6.99 


7.08 






10 


4.82 


4.98 


5.50 


6.31 


7.52 






20 


4.35 


4.52 


5.54 


8.63 


10.47 




8 


5 


4.62 


4.62 


4.86 


5.34 


5.44 






10 


3.35 


3.41 


3.78 


4.23 


4.94 






20 


2.62 


2.73 


3.10 


3.84 


5.71 




10 


5 


3.88 


3.91 


4.36 


4.77 


4.72 






10 


2.75 


2.81 


3.10 


3.52 


4.05 






20 


2.03 


2.31 


2.42 


2.86 


4.38 


10% 


5 


5 


4.92 


5.00 


5.31 


5.87 


6.01 






10 


4.39 


4.49 


4.98 


5.55 


6.41 






20 


4.18 


4.24 


4.99 


6.86 


7.61 




8 


5 


3.73 


3.73 


3.92 


4.33 


4.29 






10 


2.74 


2.77 


2.99 


3.38 


3.76 






20 


2.19 


2.25 


2.53 


3.18 


3.95 




10 


5 


3.08 


3.15 


3.38 


3.70 


3.55 






10 


2.15 


2.21 


2.42 


2.71 


3.07 






20 


1.66 


1.72 


1.91 


2.20 


2.93 
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Table 4.63: Inlier error rates for regression model: e = 15%, 20% 



r 


(I 


P 


ctf. 7 


ctf. 75 


ctf. 85 


ctf 9 


MNAVT 


15% 


5 


5 


4.13 


4.17 


4.59 


5.03 


- 






10 


4.10 


4.20 


4.55 


5.01 


- 






20 


4.40 


4.45 


4.54 


4.96 


- 




8 


5 


3.21 


3.19 


3.34 


3.54 


- 






10 


2.36 


2.35 


2.47 


2.69 


- 






20 


2.00 


2.01 


2.12 


2.38 


- 




10 


5 


2.38 


2.41 


2.54 


2.73 


- 






10 


1.70 


1.73 


1.87 


2.08 


- 






20 


1.41 


1.43 


1.56 


1.78 


- 


20% 


5 


5 


3.32 


3.40 


3.64 


4.04 


4.11 






10 


3.58 


3.75 


4.18 


4.61 


5.01 






20 


4.62 


4.70 


4.69 


4.91 


5.22 




8 


5 


2.72 


2.75 


2.75 


2.93 


2.83 






10 


2.42 


2.28 


2.12 


2.22 


2.15 






20 


2.07 


1.94 


1.82 


1.96 


2.09 




10 


5 


2.00 


1.99 


1.92 


2.04 


1.95 






10 


1.38 


1.34 


1.36 


1.49 


1.50 






20 


1.21 


1.20 


1.20 


1.31 


1.42 
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Actual Inlier Error Rates, delta=5 



b p=20 



Ctf=0.7 
Ctf=0.75 
Ctf=0.85 
Ctf=0.9 



p=5 

b 



3 P=''0 
^=5 




5 10 15 20 



epsilon 



Figure 4.14: Inlier error rates for Sc — 5 
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Actual Inlier Error Rates, delta=8 




Figure 4.15: Inlier error rates for Sc — 8 
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Actual Inlier Error Rates, delta=1 0 



p=5 



Ctf=0.7 
Ctf=0.75 
Ctf=0.85 
Ctf=0.9 




epsilon 



Figure 4.16: Inlier error rates for 5c — 10 
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that is specifically designed to model data for which the response is within the 
interval (0,1). Consider the classic linear regression model 



y = ;3^x + e. 



One of the main assumptions for this model is that the e's are normally dis- 
tributed with zero mean and constant variance. As discussed in [13, ch. 6], 
a GLM requires two specialized functions: (1) a link function g that describes 
how jd, the mean of y, depends on the linear predictors: g{jj) = /3"^x, and (2) 
a variance function that describes how the variance of y depends on the mean: 
var(?/) = (f)V{iJ,), with 0 constant. Link functions g{iJ,) are monotone increasing 
and hence invertible; it is often more convenient to describe the link function 
in terms of its inverse f: n — /(/3^x), or /j, — f{r)) with rj — f3^x. Logistic 
regression makes use of the logit link function. 



Substituting the link function into the basic regression model gives us the equa- 
tion 



which guarantees that y is in the interval (0,1]. We found the best results for 
our model when using transformed data, specifically arcsin (^err), log(p) and 
\og{5c), where err is the error rate expressed as a fraction between 0 and 1. For 
these transformations, the logistic regression model for outlier error is 




y 



1 + 
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axcsm{y/ out. err) = -4.89ctf - 4.76 log((5c) - 0.275 log(p) + 0.0678e 

+2mctf ■ - O.U2ctf ■ e - 0.9315^ • P + 0.03085^ • e 
+0.0273p-e + 11.00 
where out. err is the outher error rate as a fraction between 0 and 1. Similarly, 
the logistic regression model predicting the inlier error rate in.err is 

arcsin( Vi^^:erf) = 2.38ct/ + 0.874 log(5c) + 0.543 log(p) + 0.0397e 
-0.566ci/ • (5e - 0.0511ct/ • e - 0.3865^ ■ p 
-0.01844 • e - 3.48. 

In Figures 4.17 and 4.18 we show the log of the actual values versus the log 
of the fitted values for these models (all error rates have been scaled back to 
percentage form). Especially among the outlier errors, there are a large number 
of zeros in the actual values. There is good correspondence between the actual 
and fitted values for both inlier and outlier error rates, and the logistic regression 
model appears to be a successful fit. 

A good test for this (or any) model is its success in predicting response val- 
ues on new data. We therefore use the above models to predict error rates for 
e — 15%, and all other factor combinations. We present the actual and predicted 
values in Figures 4.19 and 4.20. It can be clearly seen that the model predictions 
are very accurate. This is an important result because it demonstrates that for 
this type of data (a mixture of x^'s), we can obtain an accurate estimate of our 
algorithm's performance on untested data sets. If the error rate is higher than 
our situation allows, we can know this in advance. Knowing the percentage 
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log(Actual Outlier Error Rates) 

Figure 4.17: Scaled actual versus fitted outlier error rates 
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log(Actual Inlier Error Rates) 



Figure 4.18: Scaled actual versus fitted infier error rates 
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of misclassifications also helps us to identify those observations that were most 
likely misclassified. For instance, if we know that approximately 5% of the des- 
ignated outliers are false positives, the misclassified points would likely be those 
designated outliers possessing the lower 5% of robust Mahalanobis distances. 
This concludes the experimental testing of our algorithm. In the final chapter 
we present some results from a theoretical analysis of it. 
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Actual Outlier Error Rates 



Figure 4.19: Actual vs. predicted outlier errors 
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Actual Inlier Error Rates 

Figure 4.20: Actual vs. predicted inlier errors 
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5. Theoretical Properties and Future Work 

One of the appeahng aspects of our algorithm is that it is affine equivari- 

ant. It calculates location and scale estimates based upon a weighted mean and 
variance. Since these weights are either zero or one, the estimates resemble a 
trimmed mean and variance, both of which are afhne equivariant. The weights 
are calculated using the biweight function, which, like other M-estimators of 
location, is not affine equivariant in its original form, but after adding a scale 
adjustment, does indeed become affine equivariant (see [35] and [38, p. 345]). 
This can be accomplished through division by a robust, affine equivariant scale 
estimate, as discussed on p. 26. Specifically, we scale the biweight estimate 
by cS, where c — 6 and S is initially the MAD and later the median absolute 
deviation from the biweight location estimate, calculated on the previous itera- 
tion. Thus, the weight function is affine equivariant, and the location and scale 
estimates on which our algorithm is based are affine equivariant, so our method 
as a whole is affine equivariant. 

As mentioned earher (p. 13), Rocke and Woodruff [66] in 1996 declared 
that all known affine equivariant methods of solving the outlier identification 
problem consisted of the following two stages: 

• Phase I: Estimate a location and shape. 

• Phase II: Calibrate the scale estimate so that it can be used to suggest 
which points are far enough from the location estimate to be considered 
outliers. 
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It is not hard to see that our method also fits into this general framework. 
During the first phase, our method uses the biweight function to obtain a set of 
weights for all np data values, and, after reassignment of these weights to either 
zero or one, results in a set of observations presumed to be outlier-free, from 
which the location and scale are estimated by means of the weighted sample 
mean and (possibly Winsorized) covariance. The second phase — calibration 
— involves computing the sample density of the robust Mahalanobis distances. 
The rejection point is determined to be the point where this sample density 
attains a value sufficiently close to zero; those observations with a Mahalanobis 
distance greater than this are considered to be outhers. The first phase of this 
method defines a robust estimator. Although we did not investigate this issue 
here, other estimators could also be used in connection with the calibration 
phase. In this thesis, the cutoff and other parameters were assigned values 
that led to increased success during outlier detection in this second stage (these 
parameters will clearly influence . Having demonstrated our algorithm's success 
in a variety of practical data situations, it would be instructive to study some 
of the theoretical properties of this robust estimator, which might prove useful 
in its own right. An appropriate tool for this purpose is the influence function. 

5.1 The Influence Function for Our Estimator 

Although weights are initially allocated to each data point by means of 

the biweight function, the reassignment of these weights to zero or one results 
in the computation of the sample mean of those points sufficiently close to 
the location estimate, while points outside the region of full acceptance are 
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not included in this computation. The latter observations (with weights equal 
to zero) nevertheless influence the estimate to a limited extent through the 
values returned by the biweight function, namely the location and scale estimates 
and more importantly, other observations' weights. The exception would be 
those points that fall outside the biweight's region of influence, namely, \x\ > 
cS, the region where the influence function of the biweight is identically zero, 
and accordingly where the observations have absolutely no effect on any of the 
returned values. 

For a; > 0, the influence function of our robust estimator will therefore con- 
sist of three piecewise linear segments, with corresponding symmetric extension 
to X < 0 such that IF(x) — —IF{—x). In a vicinity of the location estimate, 
speciflcally, that region where observations receive a weight equal to one and 
over which the sample mean is calculated, the influence function will be linear, 
with a slope close to one, resembling the sample mean. Outside of this region 
but still inside \x\ < cS, the influence function will be bounded but non-zero, 
since observations in this region still influence the calculation of the estimate 
by means of weights assigned to other observations. For \x\ > cS, the influence 
function will be zero since observations in this region are completely rejected by 
the biweight function and never exert any influence on the calculated estimates. 
Observations outside the region of full acceptance but still inside \x\ < cS have 
weights equal to zero, so their influence on the computed estimate is constant 
(but non-zero), implying a horizontal, non-zero segment. In the notation used 
on p. 24, the influence function of our estimator is therefore a special case of 
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Hampel's estimator ipa,b,r{x) with b — r — cS: ipa,r,r{x)- The exact location of 
a, the boundary points where the influence function changes from a slope close 
to one, to a non-zero horizontal segment, will depend on the cutoff^, as well as 
the speciflc distribution, since different distributions have different percentages 
of observations lying within a fixed interval of the mean. The value of r will 
depend only on the expected value of cS, which will be very similar to the ex- 
pected value of c times the MAD, and will also vary with distribution (since the 
expected value of the MAD is different for different distributions) . The value of 
r will not depend on the cutoff or any of the other parameters used by our algo- 
rithm, since it is transferred directly from Tukey's biweight function. Another 
property inherited from the biweight function is the symmetry of the infiuence 
function. Since the biweight assigns weights symmetrically on either side of x^w, 
for a symmetric distribution, an equal number of observations on each side will 
receive weights of zero or one, respectively. 

For those observations inside \x\ < r = cS, our estimator possesses similar 
characteristics to a type of adaptive trimmed mean. The a-trimmed mean is 
obtained by discarding the a% smallest and largest observations, and computing 
the sample mean of the remaining (1 — 2a) % observations. As discussed in 
Schervish [74, pp. 314-315], the a-trimmed mean can be viewed as a conditional 
mean, given that the observation falls between the a and 1 — a quantiles. That 
is. 



T{F) 



pF-i(l-Q) 
F-l(a) 



xdF{x) 



l-2a 
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for q; < |. For a — ^, this becomes the sample median, while the case a — 0 
is the regular sample mean. Although the a% smallest and largest observations 
are not included in the final calculation, they nevertheless exert an infiuence on 
the estimator because they are counted before being discarded, and cause other 
observations to be included. Adding an observation far away from the region 
of acceptance will cause another observation closer to the boundary to be in- 
cluded in this region; it doesn't matter how far away the additional observation 
is, as long as it is farther away than the observation being moved inside this 
region (that is, a constant influence function outside this region). The influence 
function of the median has zero slope for all x e 3? except at x — 0, where it 
is infinite, i.e. a discontinuity, because all observations except the middle one 
(or two, for n even) are counted but not measured. Similarly, the infiuence 
function of the trimmed mean is non-zero and horizontal outside a certain in- 
terval containing zero, specifically, outside the interval [F~^{a), F~^{1 — a)] for 
0 < q; < |, while it has positive but finite slope inside this interval because it 
resembles the mean in that region. Unlike the mean, however, the slope of the 
trimmed mean's infiuence function is Y32a' which is a consequence of the same 
denominator in its definition T{F) and shows that it obtains more information 
from the middle (1 — 2a)% observations than does the sample mean. This is 
precisely what happens with our estimator: for \x\ < a, it computes the sample 
mean, while for a < \x\ < r the observations have weight zero and exert a con- 
stant, but non-zero infiuence on the estimator. The exact value of our estimator 
is computed from the observations inside |x| < a, so it extracts more information 
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from these data points than the sample mean, and the influence function will 
have a slope greater than one. Specifically, it will have a slope of 3320' "^^^^re 
a is the percentage of observations not included in this region and like a, varies 
with the cutoff and other parameters. The influence function of our estimator 
will be horizontal and non-zero for a < \x\ < r, just like the trimmed mean, but 
will be zero for \x\ > r. The values of both a and a depend on the percentage of 
observations that are given a weight of one, so it resembles an adaptive Hampel's 
function rather than one with fixed boundary points. The value of a depends 
directly on the cutoff value as well the distribution F, and can be computed 
from the inverse of the biweight function, u{w) — yT^-y^. For instance, a 
cutoff oi w — 0.8 will cause the influence function to have positive, finite slope 
for —0.3249 < u < 0.3249, and horizontal elsewhere (still with a discontinuity at 
= r). Recall (p. 61) that u{x) is defined as where x^w is the biweight 

estimator of location and S the median absolute deviation from Xb^,- (We use 
c = 6 throughout this thesis.) The percentage of observations, 2a%, that will 
be discarded by the condition —0.3249 < u < 0.3249 will clearly depend on the 
exact distribution F. The shape of our estimator's influence function will be 
the same for all symmetric distributions, but the exact slope of the linear region 
and points of non-smoothness will depend upon the particular distribution F. 

Besides the influence function, it will also be very helpful to compute some 
of our estimator's asymptotic numerical robustness properties, which requires 
the distribution F to be specified. We will hence assume F — N{ii, a) as our 
standard reference distribution, recalling Winsor's principle that all distributions 
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are approximately Normal in the center. For an affine equi variant estimator, we 
can assume /j, — 0 and a — 1 without loss of generality. For n — 0, we have 
E{xiiyj) = 0 since the biweight is an unbiased estimator of location. Although 
it is known that E{MAD) = 0.6745(7 = 0.6745 for the N{0, 1) distribution, this 
result does not necessarily extend to the case of S (the biweight scale estimate) 
equal to the median absolute deviation from Xbw Deriving expected values 
for iterated estimates is a very complex problem, so we modifed our Fortran 
program to produce a simulated approximation. Based on 1,000,000 iterations 
of a univariate data sample of length 10,000 drawn from a N[0, 1) distribution, 
we found that £'(MAD) — E{S) — 0.674. Substituting in the expected values 
for Xbw and S, the boundary value a for full inclusion in the mean/variance 
calculation {\x\ < a) is computed as 



\u\ < 0.3249 



x-0 



< 0.3249 \x\ < 1.3149. 



6 • 0.6745 

For z ~ A^(0, 1), P{-1.315 <z< 1.315} = 2 • 0.0943; that is, 9.43% of the 
observations are expected to he on each side of the bounds ±1.3149. The slope 
of the influence function for \x\ < 1.3149 is thus i_2(o^o943) ~ 1-2324. For c = 6 
and E{S) = 0.6745, r is determined according to r = E{cS) = 6-0.6745 = 4.047. 
For cutoff=0.8 and F — N{0, 1), our estimator's influence function can thus be 
written as 
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-1.6205, 



-4.047 <x< -1.3149 



IF{x) = < 



1.2324X, 



x\ < 1.3149 



1.6205, 



4.047 >x> 1.3149 




x\ > 4.047 



We depict this influence function graphically in Figure 5.1. 

We can now easily compute some of the asymptotic robustness properties of 
this estimator at the standard Normal distribution. The gross error sensitivity 
7*, deflned as the supremum of the IF, occurs at x = 1.3149 and is equal to 
7* = 7F(1.3149) = 1.6205. This is close to the lowest possible value of 1.25 at 
this distribution, given by the sample median, and is even slightly lower than 
that of Tukey's biweight estimator, namely 1.68. The local shift sensitivity A* 
is equal to the absolute value of the maximum slope of the influence function, 
as discussed in Hoaglin et al. [38, pp. 360-361]: A* = sup^ \IF'{x)\. This is 
clearly an inflnite value, due to the discontinuity at ±4.4047. For large samples, 
an inflnite local shift sensitivity does not create many problems if one point is 
moved slightly, it can usually be replaced by another nearby point. The rejection 
point, p* = r = 4.407, a property which our estimator inherits directly from the 
biweight. We compute the asymptotic variance of this estimator according to 



V(T,F) = JZ^IF(x;F,Trf(x)dx 

= 2 J^-'''\l.232Axr^e=fdx + 2 flZi^.6205r^e-fdx + 0 
= 1.0559. 
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Figure 5.1: ^-function for our robust estimation procedure 
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so that its asymptotic efficiency relative to tfie mean at A^(0,1) is e = 10I59 ~ 
94.7%. Mosteller and Tukey [57, p. 206] state tfiat it would be "a massive job" 
in practice to distinguish between the performance of an estimator with relative 
efficiency 90% and that of the reference estimator. It is also worth noting that for 
approximately the same gross error sensitivity 7* as the biweight, this estimator 
has a lower asymptotic variance, at F = A^(0, 1). 

We repeat the same calculations for a cutoff of 0.5 (this leads to approxi- 
mately the same area under the curve for both the biweight and our proposed 
weighting function, as discussed on page 64). This will clearly include more 
values in the sample mean calculation (full inclusion) than a cutoff of 0.8. At 
the A'"(0, 1) distribution, this estimator discounts only 1.74% of the observations 
from each side. Its asymptotic variance is 1.0072, very close to the minimum 
possible variance of 1.000 at this distribution produced by the sample mean. 
Its aymptotic efficiency relative to the mean is therefore 99.3%. Compared to 
a cutoff of 0.8, this estimator has a higher gross error sensitivity of 7* = 2.187, 
which is intuitive because a lower cutoff will include more potential outliers in 
the mean/variance estimation and is less robust to their influence. 

We display some asymptotic robustness properties of these and other esti- 
mators in Table 5.1, at the standard Normal distribution. Using the notation 
from chapter 2, results for the mean, median, Huber {b — 1.5), Andrew's sine 
function (a = 1.5) and the biweight (c = 6) were obtained from Hoaglin et 
al. [38, p. 370]. We computed results for the Hampel three-part redescending 
estimator with a — 2.1, b — 3.4 and c = 8.5. (Tuning constants were selected 
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from among those values recommended by Hoaglin et al. [38] with the aim that 
where possible, these estimators should possess a similar gross error sensitivity 
7*.) With the exception of the mean, all the estimators have a breakdown point 
of 50%; the breakdown point is the only property in this table which does not 
depend on the distribution. Hoaghn et al. [38] state that the breakdown point of 
a location estimate with auxiliary scale estimate is the smaller of the breakdown 
points of the location and scale estimates. The MAD has breakdown point 50%, 
as do the median and biweight location estimates. 

Comparison of our estimator (with ctf =0.8) to the other well-known and 
frequently used estimators in Table 5.1 reveals that it is very competitive with 
regard to important robustness properties. In particular, it possesses very low 
gross error sensitivity 7* while maintaining low asymptotic variance. It was 
observed in the previous chapter that our algorithm demonstrated very good 
performance at other distributions (such as a mixture of Normals, or a mixture 
of x^'s), so it is not unreasonable to expect good robustness properties at these 
or other distributions. However, we should caution against complacency and 
there are almost certainly other estimators that could produce better results. 
We do not claim it is the best robust estimator that can be utilized for the first 
phase of our algorithm; only that it produces low error rates in outlier detection. 

It is also apparent from Table 5.1 that our algorithm with cutoff equal 
to 0.5 possesses very similar properties as the sample mean, such as very low 
variance and very low local shift sensitivity, while maintaining a finite gross error 
sensitivity and 50% breakdown point. For those situations where an estimator 
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Ta ble 5.1: Asymptotic Numerical Robustness Properties of Various Estimat ors 



Estimator 


Var. 


Rel. Eff. 


7* 


A* 


p* 


Brkdn 








(g.e.s) 


(1.S.S) 


(rej.pt) 




Mean" 


1.000 


100% 


oo 


1.000 


oo 


0 


Median" 


1.571 


63.6% 


1.25 


oo 


oo 


50% 


Huber"' 


1.037 


96.4% 


1.73 


1.15 


oo 


50% 


Hampel 


1.022 


97.8% 


1.866 


1.10 


8.5 


50% 


Andrews sine" 


1.161 


86.1% 


1.65 


1.63 


3.2 


50% 


Biweight" 


1.094 


91.4% 


1.68 


1.45 


4.05 


50% 


New, ctf ^0.8 


1.0559 


94.7% 


1.621 


oo 


4.05 


50% 


New, ctf =0.5 


1.0072 


99.3% 


2.187 


oo 


4.05 


50% 



a: Results from these estimators were obtained from [38, p. 370]. 
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similar to the mean is required, such as one with low variance, our estimator with 
cutoff=0.5 can provide a very useful alternative with much better robustness 
properties than the sample mean. 

We offer a reminder that the above results have been computed for the 
Normal distribution. If the only data encountered were Gaussian, other robust 
estimators would provide better results than ours. The point of this thesis, 
however, is to examine and identify data that are certainly not Gaussian — 
situations in which classical and certain other estimators do not perform very 
well. 

Robust estimators that might perform well during the first phase of our 
outlier algorithm would be those that provide somewhat conservative but still 
accurate, i.e., unbiased location and scale estimates across a wide variety of data 
situations. A low gross error sensitivity 7* is very important in our case. 

We should also not neglect computational speed, a critical aspect for large 
data sets. The correct answer is not very useful if it cannot be produced fast 
enough. A total execution time (i.e. both Phases I and II) of 6^ minutes for a 
data set of 1,000,000 points in 100 dimensions is faster than any other algorithm 
we examined. It is certainly possible, however, that a faster robust estimator 
could be designed without a significant loss in accuracy. Hoaglin et al. [38] note 
that the choice of estimator for a given situation can depend on many factors, 
and the best estimator for a non-Gaussian situation can often lead to subop- 
timal results on Gaussian data. For example, the computational efficiency of 
the Huber estimator might cause it to be preferable over many other estimators 
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in large data sets, or the flexibility of the Hampel three-part estimator might 
be advantageous in certain situations. These estimators were shown graphically 
in Figure 2.2 on p. 23. Our algorithm contains a great deal of inherent flexi- 
bility through several parameters such as cutoff value, Winsorization, MINWT, 
determination of final rejection point (e.g. the width of the confidence interval 
for the density slope that contains zero), etc. It also has decent computational 
speed, although further research might lead to a different weighting scheme that 
is faster to compute, for the same or similar -^-function. 

5.1.1 Examining the Accuracy of our Estimator 

To further examine the accuracy of our robust estimator, we also compare 
the resulting location and scale estimates, with the theoretical location/scale 
estimates of the inliers. We again utilize the Normal distribution, generating 
inliers from Np{0, 1), and outliers from Np{2 -1,1), Np{3 -1,1), and Np{A ■ 1, J) 
(denoted by the separation parameter Sc as 6c — 2,3,4). We consider p — 
5, 10, 20 and contamination levels of e = 5%, 10%, 20%, 30%. Instead of the 
location vector jj, and scale matrix S, though, we desire suitable scalars that 
will be easier to compare. Representative statistics can be computed from the 
mean over the p components of fi (we will call this value t), and from (we 
will call this value S). For uncorrelated situations, the off-diagonal elements of 
S should be close to zero, so the determinant will be approximately equal to 
the product of the p diagonal elements of Xl. Taking the p^^ root will therefore 
result in roughly the average variance across the p dimensions. As before, we let 
n — 10, 000 and present results calculated from the mean of 1,000 simulations. 
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Specific parameters that remained constant for all iterations were Winsorization, 
No MINWT and ctf—0.S5. We present the results numerically in Tables 5.2 and 
5.3 and graphically in Figures 5.2 and 5.3. In order to compare outlier detection 
success at the resulting t and S values, we also provide the respective inlier and 
outher error rates. Note that since our goal is to describe the inliers as accurately 
as possible without being influenced by the outliers, perfect performance would 
be given t — 0 and ^ = 1. 

Examination of these results show that for fixed levels of e and 5c, the values 
of t and S remain constant as the dimension changes; we should expect this for 
truly uncorrelated data since each dimension is independent of the others. In 
Figures 5.2 and 5.3 we therefore do not consider dimension as a variable, and plot 
the mean value obtained over the three dimensions. We also observe that for the 
lower contamination levels (e < 20%), the covariance estimate is smaller than 
I, but as the outliers become more numerous, this estimate grows larger than 
/. Comparison of the error rates to the accuracy of the location/scale estimates 
leads to the conclusion that low error rates do not depend solely on an accurate 
location and scale estimate. For instance, when e = 30% and 5c = 4, both the 
location and scale estimates are considerably larger than the theoretical values 
and larger than for any other data situation in these tables, but the error rates 
are still below 0.2% for p = 5 and below 0.01% for p = 10, 20. Additionally, 
estimates with the same degree of accuracy are produced for fixed values of Sc and 
e in different dimensions, but the error rates differ drastically. A superior robust 
estimator is therefore only one of several factors leading to improved success. For 
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high levels of contamination (e > 20%), better robust estimates would indeed 
show markedly better results, but practically, this would have to be balanced 
against the resulting higher false positives (decreased specificity) at the lower, 
and more common, contamination levels. At lower levels of contamination, we 
would expect to improve overall performance through methods other than robust 
estimation. For instance, when 5c = 2, p = 5 and e = 5%, directly substituting 
the theoretical value of S in for the estimated matrix resulted in virtually no 
improvement in error rates. When e = 30%, however, the same substitution 
resulted in error rates of 63.79/1.05, roughy 28% lower than 88.23/0.44 when 
relying upon the estimated matrix. The fact that there are still undesirably high 
error rates with a perfect covariance matrix, indicates that it might be possible 
to improve the algorithm in other areas such as possibly, the calibration stage. 

The robust estimator thus described is an M-estimator of location defined 
by the ^'-function in Figure 5.1. As such, it fits into the general framework of 
M-estimators and is described by the applicable theory. Hoaglin et al. [38] state 
that a simple reformulation of M-estimators yields a weighted mean in which 
the weights depend on the data. It is also possible to construct an M-estimator 
for a given influence function, specifying certain desirable characteristics for 
the data situation at hand. For instance, if we were dealing with a mixture of 
Gaussians (for example, shift outliers), a much simpler and quicker reformulation 
of our estimator would be to assign all points closer than 1.315(3" from (x a 
weight of one, with points further away receiving weights of zero. However, 
this reformulation would not work for data from other distributions. (Recall 
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Table 5.2: Accuracy of our robust estimator for e = 5%, 10% 



e 






f 


S 


ovf. erT%/'iv. err% 


5% 


2 


5 


0.0497 


0.669 


13.17/1.82 






10 


0.0498 


0.669 


1.40/1.35 






20 


0.0498 


0.669 


0.01/1.08 




3 


5 


0.0339 


0.687 


0.1/1.04 






10 


0.0339 


0.687 


0/0.73 






20 


0.0339 


0.686 


0/0.54 




4 


5 


0.0151 


0.682 


0/0.74 






10 


0.0152 


0.682 


0/0.48 






20 


0.0152 


0.682 


0/0.32 


10% 


2 


5 


0.108 


0.758 


17.45/1.76 






10 


0.108 


0.758 


2.86/1.32 






20 


0.108 


0.757 


0.06/1.07 




3 


5 


0.0796 


0.813 


0.17/0.85 






10 


0.0798 


0.813 


0/0.59 






20 


0.0798 


0.813 


0/0.45 




4 


5 


0.0394 


0.799 


0/0.59 






10 


0.0396 


0.799 


0/0.39 






20 


0.0396 


0.799 


0/0.27 
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Table 5.3: Accuracy of our robust estimator for e = 20%, 30% 



r 


(I 


P 


f 


S 


mif. crr%/'in. eTr% 


20% 


2 


5 


0.259 


0.981 


37.13/1.70 






10 


0.259 


0.977 


13.56/2.21 






20 


0.259 


0.972 


4.27/2.23 




3 


5 


0.221 


1.304 


0.86/0.60 






10 


0.221 


1.303 


0/0.38 






20 


0.221 


1.302 


0/0.33 




4 


5 


0.140 


1.318 


0/0.30 






10 


0.140 


1.318 


0/0.21 






20 


0.140 


1.318 


0/0.15 


30% 


2 


5 


0.465 


1.196 


88.23/0.44 






10 


0.465 


1.173 


90.20/0.30 






20 


0.465 


1.148 


85.14/6.45 




3 


5 


0.515 


2.08 


11.43/2.17 






10 


0.515 


2.03 


4.60/2.39 






20 


0.515 


1.981 


3.33/3.06 




4 


5 


0.438 


3.33 


0.19/0.13 






10 


0.438 


3.30 


0/0.01 






20 


0.438 


3.25 


0/0.01 
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Figure 5.2: Accuracy of the robust location estimate 
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Figure 5.3: Accuracy of the robust scale estimate 
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that the value 1.315 was exphcitly based on the Normal distribution; a mixture 
of skewed distributions such as x^'s would require the boundary to be moved 
considerably further away than 1.315(3".) A good question for future research 
would be to find a faster and more straightforward weighting function for the 
influence function shown in Figure 5.1. Clearly, it would be trivial to do this for a 
known distribution F, but not for unspecified distributions. If the distribution of 
the data can be determined beforehand, however (such as through previous data 
samples), our estimator could be optimized to produce significantly faster and 
more accurate results. This could be accomplished by computing the point at 
which the slope of the influence function changes from positive to zero measured 
in units of a away from /i, and directly assigning a weight of one to all points 
within this boundary and zero to points beyond it. 

Previously we mentioned that instead of using a flxed weight cutoff (such as 
0.8), we had also tried assigning a weight of one to those observations with weight 
above a certain percentile, and zero otherwise — option wpct. This method 
of weight reassignment was not utilized in this thesis because the majority of 
simulation experiments involved a wide range of contamination levels, from 5% 
to as high as 40%. If the approximate percentage of outliers can be estimated 
beforehand, however, this method will almost certainly give better results than a 
fixed weight cutoff, unless this cutoff is adjusted for the specific distribution and 
contamination level. For instance, many data sets have less than 5% outliers, so 
the robust estimator can be modified to reassign a weight of one to those points 
in the upper 95% of weights. 
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The estimator discussed here was intended for use in the outUer detection 
algorithm, but it is possible that there might be other useful applications in 
addition, such as iteratively reweighted least squares or general statistical models 
that require some type of robust estimation. Knowledge of some of the numerical 
robustness properties listed in Table 5.1 becomes critical in deciding where this 
estimator could prove beneficial. Its inherent flexibility makes it very adaptable 
to the specific needs of a particular application. For instance, an estimator 
might be needed that has high efficiency; in this case some of our estimator's 
parameters can be tuned to increase the efficiency to the desired level and no 
higher — an adjustment which can be performed on only a few other estimators. 

5.2 The Calibration Stage 

Computation of the final outlier rejection point doesn't require any assump- 
tions about the underlying data type or distribution of robust Mahalanobis dis- 
tances. This has both advantages and disadvantages. For instance, even though 
we used a computationally fast method of estimating the sample density (Sil- 
verman's kernel density estimation incorporating the FFT [76]), there might be 
faster methods of determining a rejection point. Hadi et al. [32] develop several 
correction factors to adjust the xl percentile used as an outlier rejection point; 
these corrections do not work very well in several realistic situations, however. 
It might be possible to find other correction factors, if indeed the exact condi- 
tions can be established under which the distribution of Mahalanobis distances 
and modifications thereof follow a or F distribution — this would reduce the 
computation time considerably. More knowledge concerning the distribution of 
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robust Mahalanobis distances would therefore be very valuable. 

We did not investigate the effect of smoothing the density estimate or its 
slope. Especially for smaller data sets, a more accurate sample density could 
likely be obtained after the application of an appropriate smoothing algorithm. 
Visual inspection of some of the sample density curves lead us to believe that 
it would probably be better to employ an algorithm with a fairly large smooth- 
ing window, as we noticed that some of the curves exhibited departures from 
monotonicity lasting more than just a few data points. Estimating a function's 
derivative by means of first differences is also known to be somewhat noisy, and 
it could be worthwhile to investigate different smoothing techniques applied 
specifically to this slope estimate. More complex differencing techniques also 
exist; it might be beneficial to investigate some of these, bearing in mind the 
need for computational speed. 

5.3 Different Distributions 

We have tested our algorithm over a variety of data, as well as several 
challenging Gaussian configurations, with good results. It would also be im- 
portant to verify its performance on outlier situations stemming from a wider 
variety of distributions. For instance, the slash distribution, A^(0, 1) divided by 
U (0, 1) has very heavy tails. Regarding this distribution, Tukey said, "The slash 
is as unrealistic as the Gaussian, but in the opposite direction." Nevertheless 
its usefulness lies in the fact that together with the Gaussian and the one-wild 
distribution, for which n — 1 observations come from A^(0, 1) and 1 observa- 
tion from A'"(0, 100), it defines one of the three corners in distributional space. 
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Hoaglin et al. [38, ch. 11] discuss this concept of distributional space defined 
by these three distributions: the Gaussian has fight tails and gives rise to fewer 
outliers than found in real data, the Slash has heavy tails and gives rise to more 
outliers than usually found in real data, while the one- wild has tails of interme- 
diate length, is more realistic and often appropriate for modeling real data. If 
an algorithm or estimator does well at all three distributions, it will probably 
do well at the majority of distributions. Hoaglin et al. [38, ch. 11] define the 
concept of triefficiency — the minimum efficiency of an estimator at these three 
corner distribution. The triefficiency hopefully provides a lower bound of the 
efficiency of an estimator in any given symmetric distribution. Our estimator 
possesses desirable characteristics such as low variance for pure Gaussian data, 
as evidenced by Table 5.1, and produces good results for shift outliers (a more 
challenging version of the one- wild distribution) . It would therefore be good to 
test its performance on a heavy tailed distribution such as the Slash. (A mixture 
of x^'s, with considerable overlap such as 6c = 5, might provide some indication 
of our algorithm's performance with the slash distribution.) 

Investigation of the optimal parameter values at these diverse data distribu- 
tions would be instructive, and comparison to the optimal values at the Normal 
and distributions could lead to recommended values for unknown data sit- 
uations. Future work would also include testing of this algorithm on data sets 
larger than those studied in this thesis, especially with regard to computational 
speed, and also to verify that results obtained from smaller data sets extrapolate 
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to larger sets with similar outlier configurations. 

5.4 An Optimization Approach 

If certain knowledge regarding the data can be assumed, the outlier iden- 
tification problem can be analyzed from an optimization point of view. The 
most important of these assumptions are that the true means of the inlier and 
outlier distributions are known (denote these by fi^^ and /u^^J, and also that 
these distributions have the same covariance matrix E. 

For a given rejection point M, let Q{M) be the penalty function of mis- 
classifying an observation. Letting Cj„ and Ccmt be the respective penalties of 
misclassifying inliers and outliers, we can write 

Q{M) — CinP{xi is labeled outherlccj is inlier)P(a;i is inher)-|- 
CcnxtPiVi is labeled inlier|t/j is outlier)P(t/j is outlier) 

= CinP{xi is labeled outlier |a;j is inlier) (1 — e)-|- 
CoutPiVi is labeled inlier|t/j is outlier)e 

= CinP[{xi - Hi^fE-^Xi - nj > M\xi is inlier](l - e)+ 
CoutPiiVi - Hinf^'^Vi - Hin) < M\y^ is outlierje 

where P{A\B) is the conditional probability of event A, given event B. As 
before, we let e denote the percentage of outliers in the data set. For n » p, 
we can assume the scaled Mahalanobis distance follows a Xp distribution for the 
inliers and a non-central Xp(5) distribution for the outliers: 
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i^i - l^inf^'^^i - t^in) ~ Xp, given Xi ~ iV(/x^„, S) 
iVi - fJ'inf'^'Hyi - tJ'in) ~ Xlir): given ~ S) 
where r is the non-centrahty parameter, r = (/u^^^ - /u^„)^S~^(/u^i - n^J. 
Letting F^2[x) denote the cumulative distribution function (cdf) for the Xp dis- 
tribution and F^2(^)(x) the cdf for the non-central Xp(r) distribution, we can 
then write Q as 



Q{M) = - F^2(M)](1 - e) + CoutF^iir){M)e. 

Several optimization programs such as MATLAB have built-in functions for 
F^2(M) and F^2(^-)(M), but it is also possible to analyze the problem further. 
To find the minimum of Q{M), we set the derivative with respect to M equal 
to zero: 

1^ = -C,„/^2(M)(1 - e) + a.t/x?w(M)e = 0, 

where f{x) is probability distribution function (pdf) of F{x). Note that for 
distributions, the pdf 's are much easier to work with than the cdf 's. Solving the 
above equation leads to 

n _ n l\ ^\M2-V~2- I ^ ^ M?~^e 2 Y^oo (rM)'' 

= —Cini}- - e) + C'outee^ r(|)[p^ ^p^^^ + 32r(^) ^ 1 

where for the sake of convenience, we write out the first three terms of the 
sum to infinity (convergence is assured because of the presence of A;! in the 
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denominator). For p even, | will be an integer and the Gamma function reduces 
to a simple factorial: r(k) — (A; — 1) !. Otherwise, for p odd, the Gamma function 
will possess a half-integer formula and has the special form 




(p-2)!!v^ 



where A;!! is a double factorial: 



k-{k 
k\\= < k-{k 



2)...5-3-l, A;>0,odd 



2)...6-4-2, A;>0,even 



1, 



A; = -1,0 



and the first few values are r(|) = ^/^T, r(|) = Ia/tt, r(|) = |0r; 

Of course, and figy^^ are seldom known, but the procedure outlined here 
could serve as a general approach and might be be a fruitful avenue for further 
research. 

5.5 Conclusions 

In this thesis we propose a new algorithm for detecting multivariate outliers. 
Unique features of this procedure include the use of Tukey's biweight function 
to obtain an initial subset of observations presumed to be outlier free, from 
which the sample mean and covariance can be efficiently calculated, and the 
use of the sample density of robust Mahalanobis distances to determine a final 
outlier rejection boundary. This method demonstrates a high level of success 
in identifying outliers in both a mixture of minimally separated Normal and 
distributions (the latter simulates highly skewed data situations); in several 
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cases it yields the lowest error rates of any examined method. We are also able 
to predict the percentage of outlier detection error in various situations, with 
considerable accuracy. 

Computation time is a critical aspect of most modern outlier algorithms 
because many data sets are so large that the majority of existing outlier method 
would not be able to realistically complete their execution. Our algorithm is 
the fastest of all the examined methods, and since it increases at a nearly linear 
rate with increasing dimension and linear with number of observations, it has 
excellent potential for large and very large data sets. Outlier detection success 
generally also improves with increasing dimension. 

To our knowledge, the estimated density of robust Mahalanobis distances 
has not been previously used in outlier identification, and we consider this to be 
an important contribution to outlier identification. Another unique contribution 
is the particular use of Tukey's biweight function to identify an initial subset of 
outlier free-data, from which the sample mean and covariance can be estimated, 
leads to the development of a robust estimator similar to an adaptive Hampel 
estimator. 

Finally, we performed a theoretical analysis of the robust estimator utilized 
by our algorithm, and derived several important asymptotic robustness proper- 
ties. It compares well with other robust estimators and in particular, possesses a 
low gross error sensitivity and a high relative efficiency. This estimator contains 
a large amount of inherent flexibility and by means of several parameters, can 
adapt to provide very good performance at several different data situations. We 
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are of the opinion that it provides outstanding all-round performance in outlier 
detection. 
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Appendix 

A. Abbreviated Fortran code for the proposed algorithm 
PROGRAM OUTLIER 

C 

C Use the proposed method to identify outliers 
C 

INTEGER N , P , I , IDX ( 1) , MDENS , NGOOD , NSIM , N0UT2 , IDUM 
PARAMETER (N=10000 , P=5 , MDENS=2**9 , NG00D=9500 , NSIM=1000) 
REAL*8 MVDATA(N,P) ,MU(P) ,WT(N,P) ,COV(P,P) , WCTF,WINSOR(P) 
REAL*8 NOUT(P) ,MAHALA(N) , DENSITY (2 , MDENS) , SLOPE(MDENS) ,WPCT 
REAL*8 CTF,WPCT,MU2(P) 
INTEGER ERRGD (NSIM) ,ERROUT (NSIM) 

REAL CHIl , CHI2 , MNl , MN2 , SD 1 , SD2 , MN.ERRGD , MN.ERROUT 
REAL P0MN1,P0MN2,C0RMN 

PARAMETER (CTF=0.85,WPCT=0.25,CHI1=5.0,CHI2=20. ,C0RMN=10.) 
PARAMETER (MN1=0 . 0 , MN2=2 . 0 , SD1=1 . 0 , SD2=1 . 0 , P0MN1=10 . ) 
PARAMETER (P0MN2=24.) 
REAL*4 ETIME 
REAL*4 ELAPSED (2) 
REAL TOTAL 
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INTRINSIC MINLOCDABS 

C N,P - number and dimension of observations 

C MVDATA - matrix of observations; NGOOD - number of inliers 

IDUM=-1 

MVDATA (1,1) =RAN ( IDUM) 
C These two statements are necessary for initialization of PRNG 

DO 50 K=1,NSIM 

C Choose which type of data needs to be generated: 
C CALL READ.DATA (N,P, MVDATA) 

C CALL GEN.CHISQ (N,P, MVDATA, NGOOD, CHI 1,CHI2, IDUM) 

CALL GEN_NORM (N , P ,MVDATA , NGOOD ,MN1 ,MN2 , SDl , SD2 , IDUM) 
C CALL GEN_N0RM2 (N, P, MVDATA, NGOOD, MN 1 ,MN2, SDl ,SD2 , IDUM) 

C CALL GEN.MVN (N,P, MVDATA, NGOOD, IDUM) 

C CALL GEN.POIS (N,P, MVDATA, NGOOD, P0MN1,P0MN2,C0RMN, IDUM) 

C CALL WRITE.DATA (n,p,mvdata) 

DO 10 1=1, P 

CALL BIWT (MVDATA(1:N,I) ,N,MU(I) ,WT(1:N,I)) 
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10 CONTINUE 

C Choose how to reassign 0/1 weights: 
C Chopwt2 is option wpct (see text) 

CALL CHOPWT (WT , N , P , CTF , NOUT , MVDATA , WINSOR , MU) 
C CALL CH0PWT2(WT,N,P, WPCT, WCTF, NOUT, MVDATA, WINSOR) 

C Option MINWT can be switched on or off: 
C CALL MINWT (WT,N,P) 

CALL GET.COVAR (N , P , MVDATA , MU , MU2 , WT , COV , NOUT , WINSOR) 

CALL GET.MAHALA (MVDATA, N,P,MU,MU2, COV, MAHALA) 

CALL GET.DENSITY (MAHALA, DENSITY, N,MDENS, SLOPE) 
C CALL WRITE.DATA (N,P,MDENS, DENSITY, SLOPE, MAHALA, MVDATA, WT) 

CALL GET.OUTLIERS (MAHALA, DENSITY, SLOPE, N,MDENS,NGOOD, 
+ ERRGD(K) ,ERROUT(K)) 

50 CONTINUE 
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N0UT2=N-NG00D 

MN_ERRGD=SUM(ERRGD(1 : NSIM) ) /REAL (NSIM*0 . 01*NG00D) 
MN_ERROUT=SUM (ERROUT ( 1 : NSIM) ) /REAL (NSIM*0 . 01*N0UT2) 

PRINT *, MN.ERRGD, MN.ERROUT 

TOTAL = ETIME (ELAPSED) 
PRINT *, ELAPSED (1), TOTAL 

END 



SUBROUTINE GEN.CHISQ (N,P,MVDATA,NG00D,CHI1 ,CHI2, IDUM) 

INTEGER N,P,I,J 
REAL CHI1,CHI2 
REAL*8 MVDATA(N,P) 

DO 10 I=1,NG00D 
DO 5 J=1,P 

MVDATA ( I , J) =DBLE (GENCHI (CHI 1 , IDUM) ) 
5 CONTINUE 
10 CONTINUE 
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DO 20 I=(NG00D+1) ,N 
DO 15 J=1,P 

MVDATA ( I , J) =DBLE (GENCHI (CHI2 , IDUM) ) 
15 CONTINUE 
20 CONTINUE 

RETURN 
END 



SUBROUTINE GEN.NORM (N,P, MVDATA, NG00D,MN1,MN2,SD1,SD2, IDUM) 

INTEGER N,P,I, J,NGOOD,IDUM 
REAL MN1,MN2,SD1,SD2 
REAL*8 MVDATA (N,P) 

DO 10 I=1,NG00D 
DO 5 J=1,P 

MVDATA (I , J)=DBLE(GENN0R(MN1 , SDl , IDUM) ) 
5 CONTINUE 
10 CONTINUE 

DO 20 I=(NG00D+1) ,N 
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DO 15 J=1,P 

MVDATA (I , J) =DBLE (GENNOR (MN2 , SD2 , IDUM) ) 

15 CONTINUE 
20 CONTINUE 

RETURN 
END 



SUBROUTINE GEN_N0RM2 (N , P , MVDATA , NGOOD , MNl , MN2 , SDl , SD2 , IDUM) 
C This is used for the second investigation of shift outliers 

INTEGER N,P, I, J, NGOOD, IDUM 

REAL MN1,MN2,SD1,SD2 
REAL*8 MVDATA (N,P) 

DO 10 1=1, NGOOD 
DO 5 J=1,P 

MVDATA (I , J) =DBLE (GENNOR (MNl , SDl , IDUM) ) 
5 CONTINUE 
10 CONTINUE 
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DO 20 I=(NG00D+1) ,N 
MVDATAd , 1)=DBLE(GENN0R(MN2,SD2, IDUM) ) 
DO 15 J=2,P 

MVDATA ( I , J) =DBLE (GENNOR (MN 1 , SD 1 , IDUM) ) 
15 CONTINUE 
20 CONTINUE 



RETURN 

END 



SUBROUTINE GEN.POIS (N, P, MVDATA, NGOOD, POMN 1 ,P0MN2,C0RMN, 
+ IDUM) 
INTEGER N,P,I,J 
REAL P0MN1,P0MN2,C0RMN 
REAL*8 MVDATA (N, P) ,Y(N) 



DO 10 1=1, NGOOD 
DO 5 J=1,P 

MVDATA (I , J) =DBLE (POIDEV (POMNl , IDUM) ) 
5 CONTINUE 
10 CONTINUE 
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DO 20 I=(NG00D+1) ,N 
DO 15 J=1,P 

MVDATACI , J)=DBLE(P0IDEV(P0MN2 , IDUM) ) 
15 CONTINUE 
20 CONTINUE 

DO 30 1=1, N 

Yd) =DBLE (POIDEV (CORMN , IDUM) ) 

30 CONTINUE 

DO 35 J=1,P 

MVDATA ( 1 : N , J ) =MVDATA (1:N,J)+Y(1:N) 
35 CONTINUE 

RETURN 
END 



SUBROUTINE GEN.MVN (N, P, MVDATA, NGOOD, IDUM) 
INTEGER N,P,IDUM,I,J 

REAL MEANVl (P) , MEANV2 (P) , COVMX (P , P) , X (P) , WORK (P) , COVM (P , P) 
REAL PARMl (P*(P+3)/2 + 1),PARM2 (P*(P+3)/2 + 1) 
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REAL*8 MVDATA(N,P) 

DO 10 1=1, P 
MEANV1(I)=0.0 
MEANV2(I)=3.24 
C0VMX(I,I)=1.0 
DO 5 J=1,I-1 
C0VMX(I,J)=0.5 
COVMX(J,I)=COVMX(I, J) 
5 CONTINUE 
10 CONTINUE 
COVM=COVMX 

CALL SETGMN (MEANVl ,C0VM,P,PARM1) 

DO 20 I=1,NG00D 

CALL GENMN(PARM1,X,W0RK,IDUM) 
MVDATA ( I , 1 : P) =DBLE (X ( 1 : P) ) 
20 CONTINUE 

COVM=COVMX 

CALL SETGMN (MEANV2,C0VM,P,PARM2) 
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DO 30 I=(NG00D+1) ,N 

CALL GENMN (PARM2 , X , WORK , IDUM) 
MVDATA (I , 1 : P) =DBLE (X ( 1 : P) ) 
30 CONTINUE 

RETURN 
END 

C 

SUBROUTINE WRITE.DATA (N , P , M , DENSITY , SLOPE , MAHALA , MVDATA , WT) 

C This can be useful to analyze algorithm outputs in Splus 

INTEGER N,P 

REAL*8 MVDATA (N , P) , DENSITY (2 , M) , SLOPE (M) , MAHALA (N) , WT (N , P) 

OPEN (UNIT=1 , FILE= ' data . dat ' , FORM= ' FORMATTED ' , 
+ACCESS= ' SEQUENTIAL ' , STATUS= ^ old ' ) 

C DO 10 1=1, N 

C WRITE (UNIT=1,FMT=20) WT(I,1:P) 

C 10 CONTINUE 
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WRITE (UNIT=1,FMT=20) MAHALA(l:n) 

20 FORMAT (5F12.3) 

END FILE (UNIT=1) 
CLOSE (UNIT=1) 

RETURN 

END 

C 

SUBROUTINE READ.DATA (N,P,MVDATA) 

C 

C Read observations into the NxP matrix MVDATA 
C 

INTEGER N,P,I,J 
REAL*8 MVDATA (N,P) 

OPEN (UNIT=1 , FILE= ' mvt . dat ' , FORM= ' FORMATTED ' , 
+ACCESS= ' SEQUENTIAL ' , STATUS= ' OLD ' ) 
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DO 15 1=1, N 

READ (UNIT=1,*,END=99) , (MVDATA(I , J) , J=l ,P) 
15 CONTINUE 

99 CONTINUE 

CLOSE (UNIT=1) 

RETURN 
END 

C 

SUBROUTINE BIWT (X,N,LOC,WT) 

C Fit Biweight function to data vector X of length N 
C Return robust location estimate LOC and weights WT 

INTEGER N,CT,I,J 

REAL*8 X(N) ,LGC,WT(N) ,U(N) ,TEMPN(N) ,CS,TEMP1,TEMP2,C 
PARAMETER (CT=3,C=6.0) 

TEMPN=X 

CALL MEDIAN (TEMPN,N,LOC) 
WT=0.0 
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DO 50 1=1, CT 

TEMPN=DABS (X-LOC) 
CALL MEDIAN (TEMPN,N,CS) 
CS=C*CS 
U= (X-LOC) /CS 
DO 10 J=1,N 

IF (DABS(U(J)) .LT.1.0) THEN 

WT(J)=(1.0-U(J)**2.0)**2.0 
END IF 

10 CONTINUE 
TEMPN=X*WT 
TEMP1=SUM(TEMPN (1 : N) ) 
TEMP2=SUM(WT(1:N)) 
L0C=TEMP1/TEMP2 
50 CONTINUE 

RETURN 
END 

C 

SUBROUTINE CHOPWT (WT,N,P,CTF,NOUT,MVDATA,WINSOR,MU) 

C Reassign: If weight>ctf, then weight=l, else weight=0 
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INTEGER N,P,I, J,L0W,IDX(1) 

REAL*8 WT(N,P) ,NOUT(P) ,CTF,MVDATA(N,P) ,WINSOR(P) ,MU(P) 
INTRINSIC MINLOC 

DO 20 J=1,P 

IDX=MINL0C(WT(1:N,J)-CTF, (WT(1 :N , J) . GT. CTF) .AND. 
+ (MVDATA(1:N, J) .GT.MU(J))) 

C Record data value when weight=ctf : 
WINS0R(J)=MVDATA(IDX(1) , J) 

DO 10 1=1, N 
IF (WT(I, J) .GT.CTF) THEN 

WT(I,J)=1.0 
ELSE 

WT(I,J)=0.0 
END IF 
10 CONTINUE 

NOUT (J) =REAL (N) -SUM (WT ( 1 : N , J) ) 
20 CONTINUE 
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END 

C 

SUBROUTINE MINWT(WT,N,P) 

INTEGER N,P,I 
REAL*8 WT(N,P),MINW 

INTRINSIC MINVAL 

DO 10 1=1, N 
MINW=MINVAL (WT ( I , 1 : P) ) 
WT(I,1:P)=MINW 
10 CONTINUE 

RETURN 
END 

C 

SUBROUTINE CH0PWT2 (WT,N,P,WPCT,WCTF,NOUT,MVDATA,WINSOR) 

C Option wpct: let weight=l for upper wpct of weights 
C else weight=0 
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INTEGER N,P,I, J,PCT,IDX(1) 

REAL*8 WT(N,P) ,WCTF,TEMPN(N) ,NOUT(P) ,WPCT,MVDATA(N,P) 
REAL*8 WINSOR(P) 

PCT=NINT(WPCT*N) 
TEMPN=WT(1:N,1) 

C From Numerical Recipes - find obs. ranked PCT in array TEMPN: 
WCTF=SELECTION (PCT , N , TEMPN) 

c print *, wctf 

DO 20 J=1,P 

IDX=MINLOC (DABS (WT ( 1 : N , J) -WCTF) , WT( 1 : N , J) >WCTF) 
WINS0R(J)=MVDATA(IDX(1) , J) 
DO 15 1=1, N 

IF (WT(I, J) .GT.WCTF) THEN 

WT(I,J)=1.0 
ELSE 

WT(I, J)=0.0 
END IF 
15 CONTINUE 

NOUT (J) =REAL (N) -SUM (WT ( 1 : N , J) ) 
20 CONTINUE 
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RETURN 
END 

C 

SUBROUTINE GET.COVAR (N,P,MVDATA,MU,MU2,WT,C0V,N0UT,WINS0R) 

INTEGER N,P,I,J 

REAL*8 MVDATA(N,P) ,MU(P) ,WT(N,P) ,COV(P,P) ,CTR(N,P) 

REAL*8 NOUT(P) ,WINSOR(P) ,MU2(P) 

DO 10 1=1, P 

MU2(I)=SUM(MVDATA(1:N,I)*WT(1:N,I))/(SUM(WT(1:N,I))) 

CTR(1:N,I)=(MVDATA(1:N,I)-MU2(I))*WT(1:N,I) 

C If Winsorizing, the following line needs to be used 

CGV(I,I)=SUM(CTR(1:N,I)**2.0)+(N0UT(I)*(WINS0R(I)-MU2(I))**2.0) 

C If not Winsorizing, the following two lines are used instead: 

C C0V(I,I)=SUM(CTR(1:N,I)**2.0) 

C C0V(I,I)=C0V(I,I)/(SUM(WT(1:N,I)**2.0)-1.0) 

C C0V(I,I)=C0V(I,I)/(REAL(N)+N0UT(I)-1.0) 

C Sometimes we force the var. to assume its theoretical value: 
C C0V(I,I)=1.0 
10 CONTINUE 

COV=COV/ (REAL (N) -1 . 0) 



212 



DO 20 1=1, P 
DO 15 J=1,I-1 

C0V(I,J)=SUM(CTR(1:N,I)*CTR(1:N,J))/ 
+ (SUM(WT(1:N,I)*WT(1:N,J))-1.0) 
C cov(i,j)=0.0 
COV(J,I)=COV(I,J) 
15 CONTINUE 
20 CONTINUE 

C DO 30 1=1, P 

C print '(5F10.4)', (COV(I,J), J=1,P) 
C 30 CONTINUE 

RETURN 
END 

C 

SUBROUTINE GET.MAHALA (MVDATA,N,P,MU,MU2,C0V,MAHALA,WT) 

INTEGER N,P,I, J,INFO, JOB 

REAL*8 MVDATA(N,P) ,MU(P) ,COV(P,P) , INV(P,P) ,CTR(N,P) ,WT(N,P) 
REAL*8 PPMAT (P , P) , DET (2) , MAHALA (N) , TEMPP (P) , TEMP , MU2 (P) 
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MU=MU2 

C MU is the biweight location estimate 

C MU2 is the sample mean of those obs. with wt==l 

C better results are obtained using MU2, mean of inliers 

C if switching to biweight MU, also replace MU in GET.COVAR 

DO 10 J=1,P 
CTRCl : N , J) =MVDATA ( 1 : N , J) -MU( J) 
10 CONTINUE 

PPMAT=COV 
J0B=1 

C Together, DPOFA and DPODI compute matrix inverse if J0B=1 

CALL DPOFA (PPMAT,P,P, INFO) 

IF (INFO.NE.O) THEN 
print *, info 
STOP 'info <> 0 in GET.MAHALA— DPOFA ' 

END IF 

CALL DPODI (PPMAT,P,P,DET, JOB) 

DO 20 1=1, P 
DO 15 J=1,I-1 

PPMAT ( I , J) =PPMAT ( J , I ) 
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15 CONTINUE 



20 CONTINUE 

MAHALA=0 . 0 
DO 30 1=1, N 
TEMPP=MATMUL(CTR(I , 1 :P) ,PPMAT) 
DO 25 J=1,P 

MAHALA ( I ) =MAHALA ( I ) +TEMPP ( J ) * CTR ( I , J ) 
25 CONTINUE 
30 CONTINUE 

END 

C 

SUBROUTINE GET.DENSITY (MAHALA, DENSITY, N,M, SLOPE) 

C Compute estimated density of MAHALA with Silverman's EFT 

INTEGER N,M, LOW, HIGH 

REAL*8 MAHALA (N) , DENSITY (2 , M) , MEAN , STDEV , SUMSQ , TEMPN (N) 
REAL*8 WIN, BY, TEMP, PI, SLOPE (M) ,FT(M) ,DLO,DHI,IQR, SMOOTH (M) 
REAL*8 qi,Q3 
PARAMETER (PI=3 . 1415927) 
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SUMSQ=DOT_PRODUCT (MAHALA , MAHALA) 

STDEV= (SUMSQ- (SUM (MAHALA ( 1 : N) ) **2 . 0) /DBLE (N) ) / (DBLE (N) -1 . 0) 
STDEV=DSQRT(STDEV) 

L0W=NINT(0.25*N) 
HIGH=NINT(0.75*N) 
TEMPN=MAHALA 
ICAL=0 

Q1=SELECTI0N (LOW , N , TEMPN) 
Q3=SELECTI0N (HIGH , N , TEMPN) 
IQR=Q3-Q1 

WIN=MIN(STDEV, (IQR/1.34)) 
WIN= (0 . 9*WIN) / (N**0 . 2) 

DL0=1.0 

DHI=0 . 75*MAXVAL (MAHALA ( 1 : N) ) 
BY= (DHI-DLO) /DBLE (M) 
DO 10 K=1,M 
DENSITY ( 1 , K) =DBLE (DLO+ (REAL (K) -0 . 5) *BY) 
10 CONTINUE 
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CALL DENEST (TEMPN , N , DLO , DHI , WIN , FT , SMOOTH , M , ICAL , IFAULT) 
DENSITY (2 , 1 : M) =SMOOTH 

SLOPE (2 : M)=DENSITY(2 , 2 : M) -DENSITY (2 , 1 : (M-1) ) 
SL0PE(1)=SL0PE(2) 

END 

C 

SUBROUTINE GET.OUTLIERS (MAHALA, DENSITY, SLOPE, N,M,NGOOD, 
+ ERRGD,ERROUT) 

INTEGER N,M,NGOOD 

REAL*8 MAHALA (N) , DENSITY (2 , M) , SLOPE (M) , CTF3 (M) 
REAL*8 DMIN,MAHAMIN 
INTEGER ERRGD,ERROUT 

DMIN=0 . 3*MAXVAL (DENSITY (2 , 1 : M) ) 
MAHAMIN=0 . 5*SUM (MAHALA ( 1 : N) ) /DBLE (N) 

C CTF3(1) is the rejection point, where the density slope is 
C sufficiently close to 0. The other conditions ensure that we 
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C obtain a point at the end of the main density peak. 



CTF3=PACK(DENSITY(1,1:M) , (DENSITY(1 , 1 :M)>MAHAMIN) .AND. 
+(DENSITY(2,1:M)<DMIN) .AND. (SL0PE(1 :M) <0 . 0) .AND. 
+(SL0PE(1:M)>-0.0001)) 

ERRGD=COUNT (MAHALA ( 1 : NGOOD) . GT . CTF3 ( 1) ) 
ERROUT=COUNT (MAHALA ( (NGOOD+1) : N) . LT . CTF3 ( 1) ) 

END 
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